Outliers impact the quality of data. Outliers are data points that are far away from rest of majority data points. This might be as a results of measurement error or variability in population distribution. Some statistical measures such as mean are highly sensitive to outliers while others like median ignore the presence of outliers in data. Outliers skews the data and lead to wrong analyses and misleading insights poor models among other problems. Handling outliers in data is important task and involves using domain knowledge to understand the source of the outliers and statistical techniques to detect outlier data points. In regression problems we commonly use Root Mean Squared Error (RMSE) to evaluate our model as it try to penalize outliers in the dataset. In this post we will look at how to detect outliers and handle outliers in data. Download the data for this post here.

outlier-detection-logo

Image from https://revenue-hub.com/anomaly-detection-inventory-issues-revenue-leakage/

Effects of Outliers in Data

  1. Reduces quality. Outliers’ due to measurement error implies poor quality data.
  2. Skews the mean of the data.
  3. Results to wrong analyses and misleading insights.

Ways to Detect Outliers

There are two broader ways of detecting outliers in data; by use of domain knowledge and by use of statistical tests. Below are techniques of detecting outliers in data.

  1. Domain Knowledge. The most common technique we can use to spot outliers if through the user experience. A doctor taking blood pressure of a patient has an idea of the range of normal human blood pressure based on his past experience and standard blood pressure.
  2. Z-score. The Z-score approach also referred to as the standard deviation approach is used to detect outliers in data that follows normal distribution. For normal distributed data any data away from 3 standard deviations (z-score>3) it is considered to as outliers. In standard normal distribution the data has a mean of 0 and a standard deviation of 1. We use the analogy of 65%-95%-99.7% which means that 65% of data will be within first standard deviation, 95% of data will be within second standard deviation and 99.7% of data will fall within third standard deviation. The decision that we make based on this approach is to discard the data away from 3rd standard deviation or with Z-score value greater than 3.
  3. Interquartile Range. The interquartile range is the difference between Q3*1.5-Q1*1.5 which results to 50% of the dataset which falls in the middle of the dataset. In statistics Q1 data forms 25% of the data in the lower side, Q2 is 50% which is the middle of the dataset, Q3 is the 75% of the dataset. The data points away from Q2*1.5 and Q3*1.5 is considered to as outliers. The 1.5 constant is an arbitrary value commonly used, however, depending on the distribution of data and the use-case it can vary.
  4. Boxplot. We can graphically detect outliers through boxplot. A boxplot is a tool that visually summarizes the distribution of the data by showing the minimum values falling in lower quartile Q1, median values falling in middle quartile Q3, third quartile values Q3 and outlier values (data points below Q1*1.5 and Q3*1.5). To learn more about boxplot check out post here.
  5. Scatter plot. Scatter plot is a graphically representation of continuous variable. It shows how individual values are distributed. This is useful in detecting outliers in dataset. To learn more about scatter plot check our previous post here.
  6. Histogram. Histogram is used to visualize frequency distribution of a random variable. The values are represented as a series of rectangular bars clustered into bins/classes while the height of the bars denotes the frequency of data at each bin. Histogram visually shows outlier data points. To learn more about histogram check the post here.
  7. Clustering techniques. Clustering machine learning approaches such as (DBScan and K-Means) can be used to detect outliers in dataset. In DBScan clustering we define the number of clusters and the distance between the clusters. DBSCAN – Density-Based Spatial Clustering of Applications with Noise finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.
  8. Isolation Forest. Isolation Forest is a tree-based machine learning algorithm used for anomaly detection. The IsolationForest ‘isolates’ observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature.
  9. Local Outlier Factor. The Local Outlier Factor uses the nearest neigbors technique to identify outlier data points in a dataset. This approach is ideal for data in low dimensional space. It measures the local deviation of the density of a given sample with respect to its neighbors.
  10. Minimum Covariance Determinant (MCD). Minimum Covariance Determinant (MCD) estimator is an affine equivariant estimators of multivariate location and scatter. It’s resistant to outliers hence ideal method for detecting outliers in dataset.

Outlier Detection Techniques Implementation

Load Data

                    

import pandas as pd
import numpy as np

continental_temperature_df=pd.read_csv('continental_temperature.csv')
continental_temperature_df.head()

outlier-detection-load-data

 

1. Z-score for outlier detection

Note that we filte routliers based on Z-Score value of >2 or <2

                    

from scipy import stats

z_score_temperature=stats.zscore(continental_temperature_df['AvgTemperature'])
continental_temperature_df['Z-Score']=z_score_temperature # create new column for z-score
continental_temperature_df[(continental_temperature_df['Z-Score']<-2) | (continental_temperature_df['Z-Score']>2)] # filter samples Z-score value <-2 and >2

outlier-detection-z-score

2. Interquatile Range

i. Interquartile Range using scipy iqr
                    

iqr=stats.iqr(continental_temperature_df['AvgTemperature'],interpolation='midpoint')
iqr

outlier-detection-scipy-iqr-output

ii. Interquartile Range using percentile function

                    

Q1=np.percentile(continental_temperature_df['AvgTemperature'], 25, interpolation='midpoint')
Q3=np.percentile(continental_temperature_df['AvgTemperature'], 75, interpolation='midpoint')

iqr=Q3-Q1
print("Q1 : ",Q1,"\nQ3 : ",Q3,"\nIQR : ",iqr)

outlier-detection-iqr-percentile-results

3. Boxplot

                    

import matplotlib.pyplot as plt
import seaborn as sns

sns.set(rc={'figure.figsize':(20,10)}) # Set figure size 

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20

sns.boxplot(continental_temperature_df['AvgTemperature'])
plt.show()

outlier-detection-boxplot

4. Scatter plot

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="Year", y="AvgTemperature", sizes=(1, 8), linewidth=0,data=continental_temperature_df)
plt.show()

outlier-detection-scatterplot

5. Histogram

Histogram with Kernel Density Estimation

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20

sns.histplot(continental_temperature_df, x='AvgTemperature',kde=True)
plt.show()

outlier-detection-histogram

6. DBSCAN for Outlier Detection

The output of 1- implies that the data point is an outlier

                    

from sklearn.cluster import DBSCAN

data=continental_temperature_df['AvgTemperature'].to_numpy().reshape(-1, 1)
dbscan=DBSCAN(min_samples=2, eps=3)
clusters=dbscan.fit_predict(data)
continental_temperature_df['dbscan_outliers']=clusters
continental_temperature_df[continental_temperature_df['dbscan_outliers']<0] # Show outlier data poins

outlier-detection-dbscan

7. Isolation Forest

The sklearn.ensemble class has a function called IsolationForest that returns the anomaly score of each sample using the IsolationForest algorithm.The contamination parameter determines the amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the scores of the samples. The output of 1- implies that the data point is an outlier

                    

from sklearn.ensemble import IsolationForest

data=continental_temperature_df['AvgTemperature'].to_numpy().reshape(-1, 1)
isf=IsolationForest()
model=isf.fit_predict(data)

continental_temperature_df['isolation_forest_outliers']=model
continental_temperature_df[continental_temperature_df['isolation_forest_outliers']==-1] # Show outlier data poins

outlier-detection-isolation-forest

8. Local Outlier Factor

sklearn.neighbors class has a function called LocalOutlierFactor which is an outlier detection technique based on unsupervised approach. It measures the local deviation of the density of a given sample with respect to its neighbors. It is local in that the anomaly score depends on how isolated the object is with respect to the surrounding neighborhood. The fit_predict method returns -1 for outlier data point or 1 for normal data point

                    

from sklearn.neighbors import LocalOutlierFactor

data=continental_temperature_df['AvgTemperature'].to_numpy().reshape(-1, 1)
lof=LocalOutlierFactor(n_neighbors=2)
lof_model=lof.fit_predict(data)

continental_temperature_df['lof_outliers']=lof_model
continental_temperature_df[continental_temperature_df['lof_outliers']==-1] # Show outlier data poins

outlier-detection-local-outlier-factor

9. Minimum Covariance Determinant

The sklearn.covariance class has a EllipticEnvelope function for outlier detection based on gaussian distributed dataset. This fit_predict function returns -1 implies teh data point is an outlier while 1 is normal data point.

                    

from sklearn.covariance import EllipticEnvelope

data=continental_temperature_df['AvgTemperature'].to_numpy().reshape(-1, 1)
ee=EllipticEnvelope()
ee_model=ee.fit_predict(data)

continental_temperature_df['ellipticenvelope_outliers']=ee_model
continental_temperature_df[continental_temperature_df['ellipticenvelope_outliers']==-1] # Show outlier data poins

outlier-detection-minimum-covariance-determinant

For complete code check the notebook here

Conclusion

Outliers can results to misleading results in analysis. We have to detect outliers and remove them to improve the quality of our data and get better analytical and model results. Most machine learning models are sensitive to outliers hence a need to check for outliers in our data before modelling or we can use models that are resistant to outliers. In this post we have looked at various techniques to detect outliers and how to implement them. The choice of an individual approach lies in the type of data and business use-case and a rule of thumb is to experiment with several approaches and select the one with best performance. In the previous post we have learnt about handling imbalanced data, check the post here. In the next post we will learn about sampling and various sampling techniques.

Outliers in Data

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x