When it comes to big data storage and processing the tools used are Hadoop and Apache Spark. Hadoop is a distributed storage and processing engine that utilizes MapReduce. One of the limitations of Hadoop is speed of executing big data jobs. Spark comes in to solve this problem of slow execution of big data jobs. Apache Spark in an open source, distributed and in-memory big data processing engine. Apache Spark leverages Hadoop for storage but in terms of processing it can either use Hadoop or its own cluster manager. The main advantage of Apache Spark is in faster processing of jobs. Spark provides us with the capability to run SQL, perform stream processing and real time analytics, develop robust Machine Learning models and perform graph processing. In this post we will briefly introduce Apache Spark, its features, components, data structures, deployments and sample use-cases.
Benefits of Apache Spark
- Speed. The major benefit of Apache Spark in big data processing is speed. This feature is useful in stream processing and real time analytics. This is because it uses in-memory processing, caching and optimized query execution for faster processing of big data workloads.
- Support for different programming languages. Spark enables us to write big data programs in different languages such as Java, Scala, Python and R.
- Robust components for multiple workloads. Apache Spark provides us with different tools to work with data such as Spark SQL for querying data, Spark Streaming for real time processing, Machine Learning libraries and Graph algorithms utility.
- Large and Active community.
- Fault-tolerant. Apache Spark is a distributed big data processing that provides fault-tolerant through RDDs.
Components of Apache Spark
- Spark Core. This is the foundation engine for spark platform on which high-level API’s and other Spark services are built on top of it.
- Spark SQL. This is SQL component residing on top of Spark Core that abstracts data and provides capability for interactive queries.
- Spark MLib (Machine Learning Library). This is a distributed machine learning framework for Spark.
- Spark Streaming. This is a Spark component for faster scheduling, processing. It’s useful in real time analytics use-cases.
- GrapghX. This is a distributed graph-processing framework.
Apache Spark Data Structure
Apache Spark supports three main data structures namely; RDDs, DataFrames and Datasets. We can easily move between these three data structures with a simple API call.
- Resilient Distributed Dataset (RDD). This is an immutable distributed collection of elements in data. RDDs are the fundamental data structures in Spark. RDDs are useful for low-level transformations and actions. Does not built-in optimizer.
- DataFrame. This is an immutable distributed collection of data organized into table like in a relational database. It uses a catalyst optimizer for the optimization of workloads.
- Dataset. This is a strongly-typed and immutable collection of objects mapped to a table like structure. Compared to RDDs, Datasets performs extremely faster and uses less memory. Apart from supporting strongly-typed API it also supports untyped API. Dataset is an extension of DataFrame which advantages of RDDs.
Apache Spark Deployment
Despite of Hadoop being the foundation of big data storage and processing, Spark can either leverage Hadoop or run on different deployment cluster manager. Below are deployment strategies;
- Standalone. Spark can use standalone cluster manager to execute its workloads
- Hadoop Yarn. In this case Spark runs Hadoop using YARN (Yet Another Resource Negotiator) as the cluster resource manager.
- Apache Mesos. This is a Spark deployment option using private cluster using Apache Mesos. It’s deprecated in version 3.2.0 and above.
- Open source systems such as kubernetes for automating, deploying, scaling and managing containerized applications.
Spark Modes of Deployment
Spark deployment modes refers to where driver program runs when the job is submitted. There are two main modes of Spark deployment; cluster and client mode.
- Spark Cluster Mode. The driver program does not run on local/host machine where the job is submitted but rather inside the cluster.
- Spark Client Mode. This is where the driver program runs on the host machine where the job is submitted.
Apache Spark Programming Languages
Apache Spark provides the ease of use for big data. It supports below programming languages that provides flexibility in big data computing.
- Scala. Spark is written in Scala. It’s faster than Python.
- Java. Java is a verbose language used for writing Spark programs.
- Python. Python is a flexible, easy to use and understand language. It’s heavily used in Machine Learning and Data Science workloads unlike other languages.
- R. This is a Statistical programming language.
One of the popularity of Apache Spark is its ability to process data faster and near realtime. Ther are two techniques used in Spark processing; the Saprk Streaming and Structured Streaming.
- Spark Streaming. Spark streaming is a framework on top of core spark API that provides a scalable and fault-tolerant stream processing feature. Apache streaming uses micro-batch technique for streaming. Spark Streaming uses Dstream (discretized stream) which is a high level API for representing continuous stream of data from different data sources e.g. Kafka. DStream is represented by a continuous series of RDDs.
- Structured Streaming. Structured Streaming provides a scalable and high-throughput fault-tolerant stream processing. Structured streaming runs on top of Spark SQL engine. Structured streaming uses the pooling technique which is based on trigger interval making more real time than in Spark streaming. Structured streaming used DataFrames and Datasets which are optimized unlike Spark streaming which uses Dstream that lies on top of RDDs which are not optimized internally.
Apache Spark Use-Cases
There are tons of use-cases when it comes to Apache Spark.
- Predicting customer churn by banks.
- Recommending and predicting patient treatment.
- Products recommendation to customers on e-commerce.
- Real time recommendation of videos to users.
In this post we have briefly introduced ourselves to Apache Spark. We have looked at its components, features, data structures and deployment strategies. Apache Spark is a wide and complex big data technology and we have just scratched the surface. You can refer much from the Apache Spark documentation for more details. In the next post we will look at how to get started with Apache Spark with Python. To learn a brief overview of big data check our previous post here.