This post extends the post on Boxplot in matplotlib and seaborn. Boxplot also called Box-and-Whisker is a type of visualization used for summarizing the characteristics of groups of numerical data points. It shows important statistics such as Minimum values, Maximum, First Quartile (25%), Median (Second Qquartile/50%), Third Quartile (75%), measure of dispersion, the distribution of the dataset and outliers. Boxplot is one of the important type of visualization in exploratory data analysis. In this post we will look at what’s Boxplot, its various components, when to use it and how to use boxplot in plotly. Download the data for this post here.
Components of a BoxPlot
Boxplot provides data summaries for some of important statistics as below;
- Minimum Value. The minimum value is the lowest value in the dataset excluding the outlier.
- Maximum Value. Maximum value is the largest value in dataset excluding outlier.
- Lower Quartile (Q1). This is the 25-percentile measure. Also referred to as the first quartile.
- Median also referred to as 2nd Quartile is the mid-point of the dataset. It’s represented by a line cutting across the Box to two parts.
- Upper Quartile. Also referred to as 3rd quartile it’s the 75th-percentile measure of the dataset.
- Whiskers are straight lines at the end of each side of the box. The represent the measures outside the box. They represent the 25% lower and upper measure of the dataset.
- Interquartile Range (IQR). This is represented by the box. It comprises of the 50% (range between 25% and 75%) of the data points represented by the box.
When to Use Boxplot
- Measure of median. The straight line across the box denotes the median. Median is not sensitive to outliers.
- Detect Outliers. Boxplot are integral in showing outliers in the dataset. Outliers are data points outside the whiskers.
- Measure of Dispersion. Boxplot is used to show the variability of the dataset. These include the range of lowest and highest values and the inter-quartile range.
- Show distribution of the dataset. We can see how the data is distributed with a boxplot. Normal distributed data is shown when the median (straight line) across the box is able to dissect the box into two symmetries. When the median is close to the lower quartile then the data is positively skewed, if the median is closer to upper quartile then the dataset is negatively skewed.
Boxplot in Plotly
Import Required Libraries
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
Load Data
ex_rate_df=pd.read_csv('Exchange_Rates.csv')
ex_rate_df['Country']=ex_rate_df['LOCATION']
ex_rate_df['Year']=ex_rate_df['TIME']
ex_rate_df['Rate']=ex_rate_df['Value'].round(2)
ex_rate_df.head()
Simple Boxplot
fig = px.box(ex_rate_df[ex_rate_df['Country'].isin(['FRA'])], y="Rate")
fig.update_layout(title={'text': 'Exchange Rate for FRA','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Currency', yaxis_title='Rate',
font=dict(size=20, family='Times New Romans', color='brown') )
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
Multi-Variable Boxplot
fig = px.box(ex_rate_df[ex_rate_df['Country'].isin(['FRA','CAN','AUS'])], y="Rate", x='Country', color='Country')
fig.update_layout(title={'text': 'Exchange Rate for FRA, CAN and AUS','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Country', yaxis_title='Rate',
font=dict(size=20, family='Times New Romans', color='brown') )
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
Multi-variate Boxplot with Actual Data Points
fig = px.box(ex_rate_df[ex_rate_df['Country'].isin(['FRA','CAN','AUS'])], y="Rate", x='Country', color='Country',points='all')
fig.update_layout(title={'text': 'Exchange Rate for FRA, CAN and AUS','y':0.95,'x':0.5, 'xanchor': 'center','yanchor': 'top'},
legend=dict(yanchor="top",y=0.95,xanchor="right",x=0.95),
autosize=True,margin=dict(t=70,b=0,l=0,r=0), xaxis_title='Country', yaxis_title='Rate',
font=dict(size=20, family='Times New Romans', color='brown') )
fig.update_xaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.update_yaxes(showline=True, linewidth=1, linecolor='white', gridwidth=3, gridcolor='white', mirror=True)
fig.show()
For complete code check the jupyter notebook here.
Conclusion
In this post we have looked at boxplot. Boxplots are data visualizations used to provide a high level summary of the numerical data points in the dataset. We can be able to see min and max values, quartiles, measure range, inter-quartile range and median and distribution of the data. An important use of Boxplot in data science and Machine Learning is in detecting outliers in the dataset. In the next post we will look at distribution plots and how to create them in plotly. To learn about area chart and how to create it in plolty check our previous post here. To learn about Boxplot in seaborn check our post here.