Pandas is a powerful Python data analysis library that makes working with data easier. It comes with tons of functions for manipulating data. Pandas is a standard data analysis tool in Python. It supports DataFrames as its core data structures. Spark provides Pandas users with ability to work with DataFrames in Spark through PySpark. Sparks DataFrames are highly optimized for big data workloads. To fully integrate with Pandas Saprks provides us with Pandas API to use with PySpark. User with Pandas knowledge can now benefit from Sparks speed and capability to work with lager datasets easily. This is a huge advantage for developer coming from Pandas. In this post we will look at some of the basic Pandas-on-Spark command. Download the data for this post from here.

Pandas on Spark with PySpark

Load required libraries

                    

import pyspark.pandas as ps
from pyspark.sql import SparkSession
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Create Pandas on Spark DataFrame

                    

ps_df=ps.DataFrame([['France','50M','3T'],['India','30M','30T'],['Kenya','70M','25T'],
                     ['Nigeria','90M','60T'],['China','20M','2T'],['USA','80M','30T'],
                     ['UK','70M','25T'],['USA','20M','30T'],['China','70M','25T'],
                     ['France', '50M', '3T'],['China','70M','25T'] ],
                          columns=['Country','Population','GDP'])
ps_df

spark-pandas-on-spark-df

                    

salary_df = ps.DataFrame(
    {
        "Department": ["Finance","Technology","Finance","Technology","Technology"],
        "Staff": ["Tom", "Peter","Simon", "Mary", "Jane"],
        "Salary": [90000.00, 57000.00,40000.00, 34000.00, 12000.00]
    },
)

salary_df

spark-pandas-on-spark-salary-df

Create Spark DataFrame

                    

spark = SparkSession.builder.getOrCreate()
sdf=spark.createDataFrame([['France','50M','3T'],['India','30M','30T'],['Kenya','70M','25T'],
                     ['Nigeria','90M','60T'],['China','20M','2T'],['USA','80M','30T'],
                     ['UK','70M','25T'],['USA','20M','30T'],['China','70M','25T'],
                     ['France', '50M', '3T'],['China','70M','25T'] ],
                          schema='Country string,Population string,GDP string')
sdf.show()

spark-spark-df

Create Pandas DataFrame

                    

pd_df=pd.DataFrame([['France','50M','3T'],['India','30M','30T'],['Kenya','70M','25T'],
                     ['Nigeria','90M','60T'],['China','20M','2T'],['USA','80M','30T'],
                     ['UK','70M','25T'],['USA','20M','30T'],['China','70M','25T'],
                     ['France', '50M', '3T'],['China','70M','25T'] ],
                          columns=['Country','Population','GDP'])
pd_df

spark-pandas-on-spark-df

Read external csv file with Pandas-on-SPark

                    

df=ps.read_csv("titanic.csv")
df.head()

Convert Pandas-on-Saprk dataframe to pandas dataframe

                    

ps_to_pd_df=ps_df.to_pandas()
ps_to_pd_df

spark-pandas-on-spark-df

Convert Pandas DataFrame to Pandas-on-Spark DataFrame

                    

pd_to_ps_df=ps.from_pandas(pd_df)
pd_to_ps_df

spark-pandas-on-spark-df

Convert Pandas DataFrame to Spark DataFrame

                    

pd_to_sdf=spark.createDataFrame(pd_df)
pd_to_sdf.show()

spark-spark-df

Pandas on Spark Functions

spark-pandas-on-spark-salary-df

Check rows and columns

(5, 3)

Check DataFrame types

spark-pandas-on-spark-df-statistical-summary

Check Statistical Summary

spark-pandas-on-spark-df-statistical-summary-true

Calculate Sum

233000.0

Calculate Mean

                    

salary_df['Salary'].mean()

46600.0

Calculate Standard Deviation

29117.005340522228

Calculate Variance of Salary

847800000.0

Calculate Skewnes of Salary

                    

salary_df['Salary'].skew()

0.44342185901218767

Group Salary by Department

                    

salary_df.groupby(‘Department’)[‘Salary’].sum()

spark-pandas-on-spark-df-groupby

Ploting Visualizations in Pandas on Spark

Pandas on Spark leverages plotly in the backend for visualisation

Let’s plot Salary for each Staff on a bar graph
                    

salary_df.plot.bar(x='Staff',y='Salary',color='Staff')

spark-pandas-on-spark-salary-bar-graph

Pie Chart of the Salary per Department

                    

salary_df.groupby('Department')['Salary'].sum().plot.pie()

spark-pandas-on-spark-salary-pie-graph.png

Kernel Density Estimation for a normal distribution data

                    

ps.DataFrame(np.random.normal(10,2,10000)).plot.kde(bw_method=3)

spark-pandas-on-spark-kde-plot

For complete code check the notebook here.

Conclusion

To accelerate development process of Data Scientists and Machine Learning practitioners who are skilled in Pandas, Spark enables them to easily leverage the strength of Spark in processing and working with big data. Pandas on Spark API allows for easily working with Spark without feeling a difference with Pandas. In this post we have looked at how to Pandas on Spark capabilities and basic functions in PySpark. We have also seen how to visualize data in Spark with Pandas functions. We have just scratched the surface there is more to Pandas on Spark such as advanced data visualizations, Streaming and Machine Learning. To learn more on these topics and concepts check the official Spark documentation here. In the next post we will look at SparkSQL which is another powerful feature of Spark for user who are skilled in SQL and want to leverage the strength of big data processing and anlytics with Spark. To learn about Spark DataFrames with PySpark check our previous post here.

Pandas on Spark with PySpark

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x