Pandas is a powerful Python data analysis library that makes working with data easier. It comes with tons of functions for manipulating data. Pandas is a standard data analysis tool in Python. It supports DataFrames as its core data structures. Spark provides Pandas users with ability to work with DataFrames in Spark through PySpark. Sparks DataFrames are highly optimized for big data workloads. To fully integrate with Pandas Saprks provides us with Pandas API to use with PySpark. User with Pandas knowledge can now benefit from Sparks speed and capability to work with lager datasets easily. This is a huge advantage for developer coming from Pandas. In this post we will look at some of the basic Pandas-on-Spark command. Download the data for this post from **here**.

**Pandas on Spark with PySpark**

**Load required libraries**

import pyspark.pandas as ps from pyspark.sql import SparkSession import pandas as pd import numpy as np import warnings warnings.filterwarnings('ignore')

Create Pandas on Spark DataFrame

ps_df=ps.DataFrame([['France','50M','3T'],['India','30M','30T'],['Kenya','70M','25T'], ['Nigeria','90M','60T'],['China','20M','2T'],['USA','80M','30T'], ['UK','70M','25T'],['USA','20M','30T'],['China','70M','25T'], ['France', '50M', '3T'],['China','70M','25T'] ], columns=['Country','Population','GDP']) ps_df

salary_df = ps.DataFrame( { "Department": ["Finance","Technology","Finance","Technology","Technology"], "Staff": ["Tom", "Peter","Simon", "Mary", "Jane"], "Salary": [90000.00, 57000.00,40000.00, 34000.00, 12000.00] }, ) salary_df

Create Spark DataFrame

spark = SparkSession.builder.getOrCreate() sdf=spark.createDataFrame([['France','50M','3T'],['India','30M','30T'],['Kenya','70M','25T'], ['Nigeria','90M','60T'],['China','20M','2T'],['USA','80M','30T'], ['UK','70M','25T'],['USA','20M','30T'],['China','70M','25T'], ['France', '50M', '3T'],['China','70M','25T'] ], schema='Country string,Population string,GDP string') sdf.show()

Create Pandas DataFrame

pd_df=pd.DataFrame([['France','50M','3T'],['India','30M','30T'],['Kenya','70M','25T'], ['Nigeria','90M','60T'],['China','20M','2T'],['USA','80M','30T'], ['UK','70M','25T'],['USA','20M','30T'],['China','70M','25T'], ['France', '50M', '3T'],['China','70M','25T'] ], columns=['Country','Population','GDP']) pd_df

Read external csv file with Pandas-on-SPark

df=ps.read_csv("titanic.csv") df.head()

Convert Pandas-on-Saprk dataframe to pandas dataframe

ps_to_pd_df=ps_df.to_pandas() ps_to_pd_df

Convert Pandas DataFrame to Pandas-on-Spark DataFrame

pd_to_ps_df=ps.from_pandas(pd_df) pd_to_ps_df

Convert Pandas DataFrame to Spark DataFrame

pd_to_sdf=spark.createDataFrame(pd_df) pd_to_sdf.show()

**Pandas on Spark Functions**

salary_df

Check rows and columns

salary_df.shape

(5, 3)

Check DataFrame types

salary_df.info()

Check Statistical Summary

salary_df.describe()

Calculate Sum

salary_df['Salary'].sum()

233000.0

Calculate Mean

salary_df['Salary'].mean()

46600.0

Calculate Standard Deviation

salary_df['Salary'].std()

29117.005340522228

Calculate Variance of Salary

salary_df['Salary'].var()

847800000.0

Calculate Skewnes of Salary

salary_df['Salary'].skew()

0.44342185901218767

Group Salary by Department

`salary_df.groupby(‘Department’)[‘Salary’].sum()`

**Ploting Visualizations in Pandas on Spark**

Pandas on Spark leverages plotly in the backend for visualisation

salary_df.plot.bar(x='Staff',y='Salary',color='Staff')

Pie Chart of the Salary per Department

salary_df.groupby('Department')['Salary'].sum().plot.pie()

Kernel Density Estimation for a normal distribution data

ps.DataFrame(np.random.normal(10,2,10000)).plot.kde(bw_method=3)

For complete code check the notebook **here**.

**Conclusion**

To accelerate development process of Data Scientists and Machine Learning practitioners who are skilled in Pandas, Spark enables them to easily leverage the strength of Spark in processing and working with big data. Pandas on Spark API allows for easily working with Spark without feeling a difference with Pandas. In this post we have looked at how to Pandas on Spark capabilities and basic functions in PySpark. We have also seen how to visualize data in Spark with Pandas functions. We have just scratched the surface there is more to Pandas on Spark such as advanced data visualizations, Streaming and Machine Learning. To learn more on these topics and concepts check the official Spark documentation **here**. In the next post we will look at **SparkSQL** which is another powerful feature of Spark for user who are skilled in SQL and want to leverage the strength of big data processing and anlytics with Spark. To learn about Spark DataFrames with PySpark check our previous post **here**.