Pandas gives us powerful and easy to use functions for statistical analysis. We can perform different operations and deliver powerful insights with pandas. Statistics is at the centre of any analytical and data science solution. With it we can understand the characteristics of both the sample and population sizes. We can easily make descriptive and inferential analyses of our data with Pandas functions. In this post we will look at Pandas Statistical Functions and how to apply them. This post is an extension of the previous post on Pandas Mathematical Functions. You can download the dataset for this post from here.

pandas-logo

Pandas Statistical Functions

Load Data

                    

import pandas as pd
import numpy as np

titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()

pandas-statistical-df

Summary statistics

                    

titanic_df.describe(include='all') # To also include categorical summary
titanic_df.describe()

pandas-statistical-summary

Show Min Value

                    

titanic_df['Age'].min() # min of age column only
titanic_df.min() # min of every column

pandas-statistical-min-value

Show max value

                    

titanic_df['Age'].max() # max of age column only
titanic_df.max() # max of every column

pandas-statistical-max-value

Show mode of values

                    

titanic_df['Age'].mode() # max of Age

pandas-statistical-mode-value

Show Median Value

                    

titanic_df['Fare'].median() # Median of Fare

pandas-statistical-median-value

Show sum of values

                    

titanic_df['Fare'].sum() # max of Fare

pandas-statistical-sum-of-value

Show frequency of each category

                    

titanic_df['Pclass'].value_counts() # freequency of passengers in each class

pandas-statistical-frequency-count

Calculate mean

pandas-statistical-mean-value

Calculate standard deviation

                    

titanic_df['Age'].std() # std for age only
titanic_df.std() # std for all columns

pandas-statistical-std

Show Variance

                    

titanic_df['Age'].var() # variance for age column only
titanic_df.var() # variance for all numeric columns

pandas-statistical-variance

Show Covariance

                    

titanic_df[['Age','Fare']].cov() # Covariance of Age and Fare
titanic_df.cov() # Covariance for entire dataframe

pandas-statistical-covariance

Correlation

Correlation Measures the relationship between two variables

1. Pearson Correlation

Measures the linear relationship between two variables. Pearson correlation coefficient is the default correlation method in Pandas Data Frame.

NOTE: Pearson Correlation assumes that the data is normally distributed. It’s sensitive to outliers

                    

titanic_df.corr(method='pearson') 
titanic_df.corr() # Or don't specifiy since it's the default

pandas-statistical-pearson-correlation

 

2. Spearman Rank Correlation

Measures the monotonic relationship between two variables. Does not assume normal distribution of the dataset. Has a growth rate of O(nlogn)

                    

titanic_df.corr(method='spearman')

pandas-statistical-spearman-rank-correlation

3. Kendall Rank Correlation

It measures the monotonic relationship between two variables. It does not assume normal distribution of the data. It has a growth rate of O(n^2) hence tends to be a bit slower on large dataset.

                    

titanic_df.corr(method='kendall')

pandas-statistical-kendall-correlation

Calculate Kurtosis

pandas-statistical-kurtosis

Calculate Skew

pandas-statistical-skew

Compute Percent change

Calculates the percent change over a given number of periods. Handle missing values (Nulls) before computing the percent change).

                    

titanic_df['Fare'].pct_change(periods=3)

pandas-statistical-percentage-change

Rank

ranks the data and shows the ties in data values

pandas-statistical-rank

For complete code check the jupyter notebook here.

Conclusion

In this post we have looked at various commonly used Pandas Statistical Functions. Statistics is important in analysing and interpreting analytical results and insights. Pandas provides us with tons of statistical functions to work with data. In the next post we will look at Window Functions in Pandas and how to apply them. To learn about Pandas Mathematical Functions check our previous post here.

Pandas Statistical Functions

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x