This post is an extension of histograms in matplotlib and distribution plots in seaborn. To determine which statistical tests and methods to apply to the dataset to get insights we must first understand our data. Distribution plots are important at helping us understand whether the data is skewed among other characteristics of data such as central tendency, outliers etc. which informs the tests and inferences to make. To check the distribution of our data we have several approaches such as histograms and density plots which we will be looking at in this post and how to create them in plotly. Download the data for this post here.
Histogram
Histogram is used to visualize frequency distribution of a random variable. The values are represented as a series of rectangular bars clustered into bins/classes while the height of the bars denotes the frequency of data at each bin. Histograms are important at describing the general characteristics of a population/sample size. They form the basis for advanced analytical decisions, methods and statistical inferences to make.
When to Use Histogram
Histogram is used to present the frequency distribution of a variable with x-axis representing bins/classes and y-axis showing the frequency of occurrence of each data point.
How to Use Histogram
- Select the correct number of bins that clearly depicts the distribution of the variable.
- The variable measured should be continuous numerical variable.
- Use uniform sizes of bins.
Histogram in Seaborn
Import Required Libraries
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
Histograms
Simple Normal Distribution Histogram
normal_distributed_variable=np.random.normal(10,2,10000)
fig = px.histogram(normal_distributed_variable)
fig.update_layout(title={'text': 'Normal Distribution Histogram','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
font=dict(size=20, family='Times New Romans', color='brown') )
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
Exponential Distribution in Plotly
exponential_distributed_variable = np.random.exponential(scale=0.5, size=(1000, 1))
fig = px.histogram(exponential_distributed_variable)
fig.update_layout(title={'text': 'Exponential Distribution Histogram','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
font=dict(size=20, family='Times New Romans', color='brown') )
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
Multi-variate Histogram
iris_df=pd.read_csv('iris.csv')
iris_df.head()
fig = px.histogram(iris_df, x='sepal_length', color='class',nbins=80)
fig.update_layout(title={'text': 'Iris Species Distribution Histogram','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
font=dict(size=20, family='Times New Romans', color='brown') )
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
Multi-variate Histogram with Overlay
fig=go.Figure()
fig.add_trace(go.Histogram(x=iris_df[iris_df['class']=='Iris-setosa']['sepal_length'], name='Iris Setosa'))
fig.add_trace(go.Histogram(x=iris_df[iris_df['class']=='Iris-versicolor']['sepal_length'], name='Iris Versicolor'))
fig.add_trace(go.Histogram(x=iris_df[iris_df['class']=='Iris-virginica']['sepal_length'],name='Iris Virginica'))
fig.update_traces(opacity=0.55)
fig.update_layout(title={'text': 'Sepal Length Distribution Histogram','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(title="Iris Species", yanchor="top",y=0.98,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Sepal Length', yaxis_title='Frequency',
font=dict(size=20, family='Times New Romans', color='brown'),
barmode='overlay')
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
Distribution Plots with Kernel Density Estimation
Kernel Density Estimation (KDE) is an extension of histogram but provides an important modification of smoothing the graph using a kernel function. According to Wikipedia kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.
Univariate Normal Distribution Plot
import plotly.figure_factory as ff
normal_distributed_variable=np.random.normal(10,2,10000)
fig = ff.create_distplot([normal_distributed_variable],group_labels=['Random Numbers'])
fig.update_layout(title={'text': 'Univariate Normal Distribution','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
font=dict(size=20, family='Times New Romans', color='brown'),
width=900, height=800)
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
Univariate Exponential Distribution Plot
exponential_distributed_variable = np.random.exponential(scale=0.5, size=(1000, 1))
exponential_distributed_variable=exponential_distributed_variable[~np.isnan(exponential_distributed_variable)]
fig = ff.create_distplot([exponential_distributed_variable], group_labels=['Random Numbers'])
fig.update_layout(title={'text': 'Univariate Exponential Distribution','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
font=dict(size=20, family='Times New Romans', color='brown'),
width=900, height=800)
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
Multi-Variate Distribution Plot
fig=go.Figure()
dist_data=[iris_df[iris_df['class']=='Iris-setosa']['sepal_length'], iris_df[iris_df['class']=='Iris-versicolor']['sepal_length'],
iris_df[iris_df['class']=='Iris-virginica']['sepal_length']]
grop_labels=['Iris Setosa','Iris Versicolor','Iris Virginica']
fig = ff.create_distplot(dist_data,grop_labels)
fig.update_traces(opacity=0.55)
fig.update_layout(title={'text': 'Sepal Length Distribution Plot','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(title="Iris Species", yanchor="top",y=0.98,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Sepal Length', yaxis_title='Frequency',
font=dict(size=20, family='Times New Romans', color='brown'),
width=900, height=800
)
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
Multi-Variate Distribution Plot without Histogram
fig=go.Figure()
dist_data=[iris_df[iris_df['class']=='Iris-setosa']['sepal_length'], iris_df[iris_df['class']=='Iris-versicolor']['sepal_length'],
iris_df[iris_df['class']=='Iris-virginica']['sepal_length']]
grop_labels=['Iris Setosa','Iris Versicolor','Iris Virginica']
fig = ff.create_distplot(dist_data,grop_labels,show_hist=False)
fig.update_traces(opacity=0.55)
fig.update_layout(title={'text': 'Sepal Length Distribution Plot','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(title="Iris Species", yanchor="top",y=0.98,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Sepal Length', yaxis_title='Frequency',
font=dict(size=20, family='Times New Romans', color='brown'),
width=900, height=800
)
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
For complete code check the jupyter notebook here.
Conclusion
In this post we have looked at histograms and Kernel Density Estimation (KDE) and how to create distribution plots in plotly. Distribution plots are useful in showing the frequency distribution of a continuous numerical variable. They inform us about the characteristics of the data and which statistical methods/tests to carry out. In the next post we will look heat maps in plotly. To learn about boxplots in plotly check our previous post here. To learn about distribution plots in seaborn check our post here.