Structured Query Language is a foundational relational databases language, it is the primary language for manipulating and organizing data in database systems and most familiar to many data practitioners. Spark provides us with SparkSQL which is responsible for executing SQL queries. When running SQL queries through a high-level programing language the resultset is a DataFrame or a Dataset. Spark SQL offers a distributed capability for running complex queries to large datasets at scale. In this post we will writing Spark SQL with PySpark. Download the dataset for this post here.

apache-spark-logo

Spark SQL with PySpark

Load required libraries

                    

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

Load data

                    

data=spark.read.csv("titanic.csv", inferSchema=True, header=True)
data.show(5)

spark-sql-df

Show Data Structure

spark-sql-df-show-schema

Register a DataFrame as an SQL Temporary View

                    

data.createOrReplaceTempView('titanic')

Select 5 records

                    

spark.sql("SELECT * FROM titanic limit 5").show()

spark-sql-df-select-5-records

Get number of Passengers in Each Class

                    

spark.sql("SELECT pclass AS Passenger_Class, COUNT(*) AS Number_of_Passengers FROM titanic GROUP BY pclass").show()

spark-sql-number-of-passenger-in-each-class

SparkSQL Mathematical Functions

Get First, Last, Min, Max, Mean and Total fare of Passenger

                    

spark.sql("SELECT FIRST(Fare) AS First_Record, LAST(Fare) AS Last_Record,MIN(Fare) AS Min_Fare, "+
          "MAX(Fare) AS Max_Fare,AVG(Fare) AS Average_fare, SUM(Fare) AS Total_Fare FROM titanic").show()

spark-sql-stats-summary

Get Standard Deviation of the Fare for passengers

                    

spark.sql("SELECT stddev(Fare) AS Fare_Standard_Deviation FROM titanic").show()

spark-sql-stdev

Get Variance of Fare distribution for the passengers

                    

spark.sql("SELECT variance(Fare) AS Fare_Variance FROM titanic").show()

spark-sql-variance

Get Skewness of Fare distribution for the passenger

                    

spark.sql("SELECT skewness(Fare) FROM titanic").show()

spark-sql-skew

Get kurtosis of the Fare distribution for the passengers

                    

spark.sql("SELECT kurtosis(Fare) AS Kurtosis FROM titanic").show()

spark-sql-kurtosis

Analytical Functions

                    

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number

df=data['pclass','name','survived','fare']
df

spark-sql-analytical-function-df

row number Window Function
This function returns sequential row number starting from 1 to the result of each window partition

                    

win_function=Window.partitionBy("pclass").orderBy("Fare")
df.withColumn("Row_Number",row_number().over(win_function)).show(truncate=False)

spark-sql-row-number

rank() window function
This function returns a rank to the result within a window partition while leaving gaps where there is duplicates

                    

from pyspark.sql.functions import rank

df.withColumn("Rank", rank().over(win_function)).show(truncate=False)

spark-sql-rank

dense_rank() window function
This function returns a rank to the result within a window partition without leaving gaps where there is duplicates

                    

from pyspark.sql.functions import dense_rank

df.withColumn("Dense_Rank",dense_rank().over(win_function)).show(truncate=False)

spark-sql-dense-rank

percent_rank Window Function
The PERCENT_RANK function computes the rank of the passenger’s fare within a passenger class (pclass) as a percentage

                    

from pyspark.sql.functions import percent_rank
df.withColumn("Percent_Rank",percent_rank().over(win_function)).show(truncate=False)

spark-sql-percent-rank

cume_dist Window Function
This function returns the cumulative distribution of values within a window partition

                    

from pyspark.sql.functions import cume_dist
df.withColumn("Cummulative_Dist",cume_dist().over(win_function)).show(truncate=False)

spark-sql-cume-dist

lag Window Function
The lag function returns the previous value

                    

from pyspark.sql.functions import lag

df.withColumn("Previous_Fare",lag("Fare",1).over(win_function)).show()

spark-sql-lag

lead Window Function
The lead function returns the next value

                    

from pyspark.sql.functions import lead

df.withColumn("Next_Fare",lead("Fare",1).over(win_function)).show(truncate=False)

spark-sql-lead

For complete code check SparkSQL notebook here

Conclusion

SQL provides us with powerful functions to manipulate and analyse data in relational databases. To leverage the power of SQL in big data Apache Spark provides us with Spark SQL engine for executing queries on big data workloads. In this post we have looked at common Spark SQL functions for analysing data including mathematical and analytical functions. We have only touched the surface of SQL functions, to learn more about SQL on Spark with PySpark check Apache Spark documentation here.   In the next post we will look at feature engineering which is an important process in Machine Learning modelling. To learn about Pandas API on Spark with PySpark check our previous post here.

Spark SQL with PySpark

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x