Spark is a popular open source, distributed and in-memory big data processing engine. It’s loved because of its speed and flexibility to support many programming languages such as Scala, Java, Python and R. Spark is originally written in Scala. The most commonly used programming language in Data Science, Machine Learning and Analytics is Python. To enable data practitioners use Python with Spark led to introduction of PySpark. PySpark accelerates big data processing tasks by users who are familiar with Python programming language. In this post we will learn about key concepts in Spark, how Spark executes and how to get started with Spark in Python through PySpark.
Key Concepts in Spark
- Application. This is a complete computation made up of a driver process and executor processes that runs user-supplied code to produce results.
- Cluster Manager. It is a service that provides resources to worker nodes. Cluster manager enables us run Spark. It can be Standalone, Hadoop YARN, Apache Mesos or Kurbenetes container.
- SparkSession. This is a unified entry point to a Spark application. It’s one of the first object we create when developing a Spark application and enables you to start working with RDDs, DataFrames adn Datasets.
- Worker Node. It’s a node that runs an application code in the cluster. It receives tasks and instructions from master node and executes it. It’s also called slave node. Worker node performs data processing, read and writes data to external files/sources and stores computation results in memory or storage locations.
- Task. This is a unit of work.
- Job. This is a parallel computation made up of multiple tasks.
- Stage. This is a smaller set of tasks that have dependency to each other.
- SparkContext. This is an entry point to Spark programming with RDD.
- Driver Program. This is the process that runs the master node and executes the main() function and creates a SparkContext.
- Executor. This is a worker node processes responsible for running individual tasks in a Spark job.
Spark Execution Flow
The Spark execution process starts when the client submits an application to the resource manager. The resource manager launches the application master and runs the driver program. The driver program creates Spark Context that requests resources from the resource manager. A signal is send to the worker node that in turn launches the Spark executors. The Spark executors are then registered in the driver program in application master. After registering the executors the driver program launches specific tasks in the Spark executor in the worker node.
There are different ways install and get started with Spark. Chek Spark documentation here for full installation process. For working with Python we can install PySpark through pip command as below;
pip install pyspark
As of today PySpark is meant to connect to existing clusters such as standalone, YARN or Mesos but does not allow you to set up your standalone Spark cluster. PySpark requires some Python dependency and other dependencies such as Py4J.
To install Spark on Windows follow the steps in this link.
Spark can also be used as a hosted platforms within an organization. Spark provides hosted notebooks in mybinder
Cloud that you can use to run Spark applications without going through the setup process.
Testing PySpark in Jupyter
Now that you have set up your Spark cluster and configured Jupyter Notebook, it’s time to test our Spark in Python.
Import findspark to use Spark in jupyter notebook
import findspark findspark.init() from pyspark import SparkContext
Create Resilient Distributed Dataset (RDD)
data=sc.parallelize([['France','50M','3T'],['India','30M','30T'],['Kenya','70M','25T'], ['Nigeria','90M','60T'],['China','20M','2T'],['USA','80M','30T'], ['UK','70M','25T'],['USA','20M','30T'],['China','70M','25T'], ['France', '50M', '3T'],['China','70M','25T'] ]) data
Show RDD Data
For complete code check the notebook here.
In this post we have looked at getting started with Spark in Python through PySpark. We have looked at key concepts in Spark and how it executes. We have also written a simple Spark program in Python. In the next post we will look at Spark RDD with PySpark. To learn about Spark and why it’s highly recommended for big data processing and analytics check our previous post here.