This post is an extension of histograms in matplotlib and distribution plots in seaborn. To determine which statistical tests and methods to apply to the dataset to get insights we must first understand our data. Distribution plots are important at helping us understand whether the data is skewed among other characteristics of data such as central tendency, outliers etc. which informs the tests and inferences to make. To check the distribution of our data we have several approaches such as histograms and density plots which we will be looking at in this post and how to create them in plotly. Download the data for this post here.

plotly-logo

Histogram

Histogram is used to visualize frequency distribution of a random variable. The values are represented as a series of rectangular bars clustered into bins/classes while the height of the bars denotes the frequency of data at each bin. Histograms are important at describing the general characteristics of a population/sample size. They form the basis for advanced analytical decisions, methods and statistical inferences to make.

When to Use Histogram

Histogram is used to present the frequency distribution of a variable with x-axis representing bins/classes and y-axis showing the frequency of occurrence of each data point.

How to Use Histogram

  1. Select the correct number of bins that clearly depicts the distribution of the variable.
  2. The variable measured should be continuous numerical variable.
  3. Use uniform sizes of bins.

Histogram in Seaborn

Import Required Libraries

                    

import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go

Histograms

Simple Normal Distribution Histogram

                    

normal_distributed_variable=np.random.normal(10,2,10000) 

fig = px.histogram(normal_distributed_variable)

fig.update_layout(title={'text': 'Normal Distribution Histogram','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
                          legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
                  autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
                  font=dict(size=20, family='Times New Romans', color='brown') )

fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)

fig.show()

plotly-simple-univariate-normal-distribution-histogram

Exponential Distribution in Plotly

                    

exponential_distributed_variable = np.random.exponential(scale=0.5, size=(1000, 1))

fig = px.histogram(exponential_distributed_variable)

fig.update_layout(title={'text': 'Exponential Distribution Histogram','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
                          legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
                  autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
                  font=dict(size=20, family='Times New Romans', color='brown') )

fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)

fig.show()

plotly-simple-univariate-exponential-distribution-histogram

Multi-variate Histogram

                    

iris_df=pd.read_csv('iris.csv')
iris_df.head()

seaborn-scatter-plot-load-iris-data

                    

fig = px.histogram(iris_df, x='sepal_length', color='class',nbins=80)

fig.update_layout(title={'text': 'Iris Species Distribution Histogram','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
                          legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
                  autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
                  font=dict(size=20, family='Times New Romans', color='brown') )

fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)

fig.show()

plotly-multivariate-histogram

Multi-variate Histogram with Overlay

                    

fig=go.Figure()
fig.add_trace(go.Histogram(x=iris_df[iris_df['class']=='Iris-setosa']['sepal_length'], name='Iris Setosa'))
fig.add_trace(go.Histogram(x=iris_df[iris_df['class']=='Iris-versicolor']['sepal_length'], name='Iris Versicolor'))
fig.add_trace(go.Histogram(x=iris_df[iris_df['class']=='Iris-virginica']['sepal_length'],name='Iris Virginica'))

fig.update_traces(opacity=0.55)
fig.update_layout(title={'text': 'Sepal Length Distribution Histogram','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
                          legend=dict(title="Iris Species", yanchor="top",y=0.98,xanchor="right",x=0.95),
                  autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Sepal Length', yaxis_title='Frequency',
                  font=dict(size=20, family='Times New Romans', color='brown'),
                 barmode='overlay')

fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)

fig.show()

plotly-multivariate-histogram-with-overlay

Distribution Plots with Kernel Density Estimation

Kernel Density Estimation (KDE) is an extension of histogram but provides an important modification of smoothing the graph using a kernel function. According to Wikipedia kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.

Univariate Normal Distribution Plot

                    

import plotly.figure_factory as ff

normal_distributed_variable=np.random.normal(10,2,10000) 

fig = ff.create_distplot([normal_distributed_variable],group_labels=['Random Numbers'])

fig.update_layout(title={'text': 'Univariate Normal Distribution','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
                          legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
                  autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
                  font=dict(size=20, family='Times New Romans', color='brown'),
                 width=900, height=800)

fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)

fig.show()

plotly-univariate-distribution-plot

Univariate Exponential Distribution Plot

                    

exponential_distributed_variable = np.random.exponential(scale=0.5, size=(1000, 1))
exponential_distributed_variable=exponential_distributed_variable[~np.isnan(exponential_distributed_variable)]

fig = ff.create_distplot([exponential_distributed_variable], group_labels=['Random Numbers'])

fig.update_layout(title={'text': 'Univariate Exponential Distribution','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
                          legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
                  autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Number', yaxis_title='Count',
                  font=dict(size=20, family='Times New Romans', color='brown'),
                 width=900, height=800)

fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)

fig.show()

plotly-univariate-exponential-distribution-plot

Multi-Variate Distribution Plot

                    

fig=go.Figure()

dist_data=[iris_df[iris_df['class']=='Iris-setosa']['sepal_length'], iris_df[iris_df['class']=='Iris-versicolor']['sepal_length'],
          iris_df[iris_df['class']=='Iris-virginica']['sepal_length']]
grop_labels=['Iris Setosa','Iris Versicolor','Iris Virginica']

fig = ff.create_distplot(dist_data,grop_labels)

fig.update_traces(opacity=0.55)
fig.update_layout(title={'text': 'Sepal Length Distribution Plot','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
                          legend=dict(title="Iris Species", yanchor="top",y=0.98,xanchor="right",x=0.95),
                  autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Sepal Length', yaxis_title='Frequency',
                  font=dict(size=20, family='Times New Romans', color='brown'),
                  width=900, height=800
                 )

fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)

fig.show()

plotly-multivariate-distribution-plot

Multi-Variate Distribution Plot without Histogram

                    

fig=go.Figure()

dist_data=[iris_df[iris_df['class']=='Iris-setosa']['sepal_length'], iris_df[iris_df['class']=='Iris-versicolor']['sepal_length'],
          iris_df[iris_df['class']=='Iris-virginica']['sepal_length']]
grop_labels=['Iris Setosa','Iris Versicolor','Iris Virginica']

fig = ff.create_distplot(dist_data,grop_labels,show_hist=False)

fig.update_traces(opacity=0.55)
fig.update_layout(title={'text': 'Sepal Length Distribution Plot','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
                          legend=dict(title="Iris Species", yanchor="top",y=0.98,xanchor="right",x=0.95),
                  autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Sepal Length', yaxis_title='Frequency',
                  font=dict(size=20, family='Times New Romans', color='brown'),
                  width=900, height=800
                 )

fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)

fig.show()

plotly-multivariate-distribution-plot-without-histogram

For complete code check the jupyter notebook here.

Conclusion

In this post we have looked at histograms and Kernel Density Estimation (KDE) and how to create distribution plots in plotly. Distribution plots are useful in showing the frequency distribution of a continuous numerical variable. They inform us about the characteristics of the data and which statistical methods/tests to carry out. In the next post we will look heat maps in plotly. To learn about boxplots in plotly check our previous post here. To learn about distribution plots in seaborn check our post here.

Distribution Plots in Plotly

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x