It’s with no doubt that SQL and Python are the powerful and mostly used languages for working with data. These tools are nowadays used in highbred fashion. Window functions are data manipulation constructs that enables us to operate on a set of rows and return a single aggregated value for each row. Window functions are mostly used for advanced analytical use-cases in SQL. We have different types of Window functions in SQL context such as aggregate, rank and value based window functions. Pandas library has come a long way and is mature enough to offer window functions similar to what we have in SQL. This allows us to perform complex analytical tasks in Pandas with ease. In this post we will look at Window Functions in Pandas and how to use them.
Window Functions in Pandas
Create DataFrame
windows_data_df = pd.DataFrame(
{
"Students": ["Tom", "Peter", "Mary", "Smith"],
"Reg_No": ["1790","1731","1780", "1755"],
"Reg_Date": ["15/01/2021", "13/01/2021", "14/01/2021", "27/01/2021"],
"Math": [79.00, 67.00, 84.00, 70.00],
"Physics": [60, 70, 50, 90],
"Computer": [65.80, 80, 70, 75],
}
)
windows_data_df
Lag in Pandas
The Lag Window function gets the previous value based on a defined key such as timestamp. Note: It’s best practice to sort values first based on a key column such as date.
windows_data_df=windows_data_df.sort_values(by='Reg_Date', ascending=False)
windows_data_df['Previous_Computer'] = windows_data_df['Computer'].shift(1)
windows_data_df
Lead in Pandas
The Lead Window function gets the next value based on a defined key such as timestamp. Note: It’s best practice to sort values first based on a key column such as date.
windows_data_df=windows_data_df.sort_values(by='Reg_Date', ascending=False)
windows_data_df['Next_Computer'] = windows_data_df['Computer'].shift(-1)
windows_data_df
Rolling window
Generic fixed or variable sliding window over the values.
'''Look at how we are calculating the Rolling_Window_Physics column. We're going back and adding previous
number to the current. In this case our window=2'''
windows_data_df['Rolling_Window_Physics']=windows_data_df['Physics'].rolling(window=2).sum()
windows_data_df
Specifying Weights in the Windows
Passing win_type to .rolling generates a generic rolling window computation, that is weighted according the win_type. Refer to documentation for more details.
windows_data_df['Rolling_Window_Physics']=windows_data_df['Physics'].rolling(window=2, win_type='triang').sum()
windows_data_df
Expanding window
Accumulating window over the values. An expanding window yields the value of an aggregation statistic with all the data available up to that point in time.
'''Expanding window with min_periods=1 with sum function resembes cumulative sum.
Trying using other aggregation functions such as mean etc.'''
windows_data_df['Expanding_Window_Physics']=windows_data_df['Physics'].expanding(min_periods=1).mean()
windows_data_df
Exponentially Weighted window
Exponentially Weighted window: Accumulating and exponentially weighted window over the values. An exponentially weighted window is similar to an expanding window but with each prior point being exponentially weighted down relative to the current point. Available EW functions are mean(), var(), std(), corr(), cov()
Let’s first get the data ready
# Let's load our sales data
clv_df=pd.read_csv('clv.csv')
clv_df.head()
Let’s created a sales column by multiplying Quantity by UnitPrice
clv_df['sales']=clv_df['Quantity']*clv_df['UnitPrice']
# Let's order our data in ascending order based on InvoiceDate
clv_df=clv_df.sort_values(by='InvoiceDate',ascending=True)
clv_df.head()
exponentially weighted moving average
Exponentially weighted moving average gives more weight to recent observations, which makes it powerful to capture recent trends more quickly.
### Exponential Moving Average on sales
clv_df['5_day_Sales_EWM'] = clv_df['sales'].ewm(span=5).mean()
clv_df.head(10)
For complete code check the jupyter notebook here.
Conclusion
Mastering SQL and Python as a Data practitioner is a blessing as you can solve almost all the data problems with these two languages. Window functions are powerful techniques for slicing data to present complex analytical insights in simple ways. Pandas offers SQL-lite capabilities of Window functions at our disposal for data manipulation and analysis. In this post we have looked at various common yet powerful Window functions. In the next post we will switch to Data Visualization and its significance in data analytics. To learn about Pandas Statistical Functions check our previous post here.