Apache’s Spark is widely considered to be one of the most high-performing data processing engines.
However, efficiency is not it’s only claim to fame. Since version 2.3 of Spark, it’s possible to load data into a fault-tolerant stream with more features than ever (or… since 2.2.1). This is awesome because as demands for actionable data insights increase, we have a solution for real-time streaming that handles a lot of the messy details for us so we can focus on gaining actionable items and creating products that benefit from that “real-timeness”.
Personally, what I like about Spark is that I don’t have to learn Scala if I don’t want to thanks to the python API, PySpark.
from pyspark.sql.functions import *
But we’ll get more into that later.
For now, let’s try to imagine a scenario where we need:
- high efficiency (lots of data),
- fault-tolerance (production use-case),
- real-time data thoroughput.
- and you need to set it up quickly using tools you and your team are comfortable with
Might sound like a tall order (yet a realistic one), but we’re going to find out how easy it can be to configure with Spark and Amazon Web Services (provided you’re comfortable with Mongo as well as SQL)
As you’d have probably guessed, there is no point in a data pipeline without data. Luckily, Kaggle has some interesting datasets we can download and play with for free. Here’s the one I’ll be using for this scenario.
The reasons I chose cellular phone sensor data are:
- we have a lot of features, which is typical in a real-world scenario
- it’s going to be interesting to see if we can use this data to predict how someone is positioned
So I should admit, I get excited about the data and forget that the most important thing is to have your environment set up. In the real world, you’ll need and have the time to understand your data first.
I’ll be using Ubuntu 16.04, Spark 2.2.1, the latest PySpark version, Kafka as a messenger, and an AWS Kinesis cluster. That’s a mouthful, but we’ll break down each component to better understand what we’re doing in this scenario and why.
- Ubuntu, because it’s tried-and-true.
- Spark 2.3 as of writing this, the newest version (which supports structured streaming)
- PySpark, to connect to Spark using python (let’s pretend our team knows python best)
- AWS Kinesis so we can save data, do other things with the data
- Kafka as a messenger between Mongo & Spark
There are lots of ways to begin using Spark, AWS even provisions EMR (ElasticMapReduce) instances with Spark preinstalled, but we’re going to be adding a 3rd party mongo connector to our Spark installation. So for today, we will be doing it locally.
Spark depends on Scala, which is based on Java. So, assuming you have a fresh Ubuntu installation, we need the default JDK first.
sudo apt-get update && sudo apt-get install default-jdk
Then install Scala
sudo apt-get install scala
To make sure it works, let’s try using scala to print “hello world”:
then quit with
Great! Now let’s install Spark. Feel free to head over to the official Spark downloads page but for brevity, get 2.2.1 and extract it using the following commands:
At this point, Spark’s shell should pop up! Go ahead and test that Scala and Spark are working:
>println(“Spark is working!”)
Got it? Awesome!
Note: Ubuntu has Python 3 installed by default, but it’s not the default Python.
Now, grab the packages we need:
python3 -m pip install pyspark
That last pip command brings us to our next installation requirement, which is Kafka.
Kafka is a messenger solution that allows us to inform streams (and other configurations) of events. Our event will be the submission of data, and that data will be in the body of the message. In a production environment, you might have a Mongo database with a Kafka publisher that goes off whenever information is stored, which then sends it out to be processed or stored in other infrastructure.
Installation is straightforward, but you need Java and some other things:
sudo apt-get update -y
Whew! So what did we just do there?
- We added a repository to obtain Java, and then we confirmed the installation worked.
- We installed Zookeeper, a config manager that Kafka uses (it’s easy, don’t worry)
- We downloaded Kafka, made a directory in opt just for it, and extracted Kafka, started the server using basic config.
Done! Now it’s time to work with our data.
So, from here on out, I’ll be using Jupyter Notebook. I think it’s a great tool for data scientists who need to rapidly communicate and share ETL processes (let’s be honest, 90% of the job is ETL and cleaning), algorithms, and even thought processes.
Let’s check out what PySpark can do now with our configured environment.
Awesome! We successfully started a Spark Session, gave it a name, and read from our CSV file we downloaded from Kaggle.
At the end, we performed some basic transformation.
But we’re not doing it as it arrives… we’re doing it with a local file!
So the next step is to create our mock data producer, grab from the database we’re inserting live data to, and finally grab it with Spark to send off to the cloud after any ETL we want to do.
In part 2, I will demonstrate how we take our data, process it in a messenger stream with Kafka, and begin the exciting part where we perform some machine learning/analysis with Spark, and later in a Kinesis stream to ElasticSearch and S3!