Pandas gives us powerful and easy to use functions for statistical analysis. We can perform different operations and deliver powerful insights with pandas. Statistics is at the centre of any analytical and data science solution. With it we can understand the characteristics of both the sample and population sizes. We can easily make descriptive and inferential analyses of our data with Pandas functions. In this post we will look at Pandas Statistical Functions and how to apply them. This post is an extension of the previous post on Pandas Mathematical Functions. You can download the dataset for this post from here.
Pandas Statistical Functions
Load Data
import pandas as pd
import numpy as np
titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()
Summary statistics
titanic_df.describe(include='all') # To also include categorical summary
titanic_df.describe()
Show Min Value
titanic_df['Age'].min() # min of age column only
titanic_df.min() # min of every column
Show max value
titanic_df['Age'].max() # max of age column only
titanic_df.max() # max of every column
Show mode of values
titanic_df['Age'].mode() # max of Age
Show Median Value
titanic_df['Fare'].median() # Median of Fare
Show sum of values
titanic_df['Fare'].sum() # max of Fare
Show frequency of each category
titanic_df['Pclass'].value_counts() # freequency of passengers in each class
Calculate mean
titanic_df['Age'].mean()
Calculate standard deviation
titanic_df['Age'].std() # std for age only
titanic_df.std() # std for all columns
Show Variance
titanic_df['Age'].var() # variance for age column only
titanic_df.var() # variance for all numeric columns
Show Covariance
titanic_df[['Age','Fare']].cov() # Covariance of Age and Fare
titanic_df.cov() # Covariance for entire dataframe
Correlation
Correlation Measures the relationship between two variables
Measures the linear relationship between two variables. Pearson correlation coefficient is the default correlation method in Pandas Data Frame.
NOTE: Pearson Correlation assumes that the data is normally distributed. It’s sensitive to outliers
titanic_df.corr(method='pearson')
titanic_df.corr() # Or don't specifiy since it's the default
2. Spearman Rank Correlation
Measures the monotonic relationship between two variables. Does not assume normal distribution of the dataset. Has a growth rate of O(nlogn)
titanic_df.corr(method='spearman')
3. Kendall Rank Correlation
It measures the monotonic relationship between two variables. It does not assume normal distribution of the data. It has a growth rate of O(n^2) hence tends to be a bit slower on large dataset.
titanic_df.corr(method='kendall')
Calculate Kurtosis
titanic_df.kurtosis()
Calculate Skew
titanic_df.skew()
Compute Percent change
Calculates the percent change over a given number of periods. Handle missing values (Nulls) before computing the percent change).
titanic_df['Fare'].pct_change(periods=3)
Rank
ranks the data and shows the ties in data values
titanic_df.rank().head()
For complete code check the jupyter notebook here.
Conclusion
In this post we have looked at various commonly used Pandas Statistical Functions. Statistics is important in analysing and interpreting analytical results and insights. Pandas provides us with tons of statistical functions to work with data. In the next post we will look at Window Functions in Pandas and how to apply them. To learn about Pandas Mathematical Functions check our previous post here.