In machine learning classification problem imbalanced dataset affects the performance of the model. Imbalance data is when the classes in the target variable have unequal distributions in the dataset. If the target variable has two classes A and B, if the number of samples in class A are 90% and that of B are 10% then the data is imbalanced. This affects most machine learning algorithms since they are not resistant to imbalanced data. However, some machine learning algorithms such as decision tree have the capability of handling imbalanced data. There are various techniques of handling imbalanced data and the choice of each approach requires you to experiments on various techniques to arrive at the one that yields an optimum performance for the model.

imbalanced-data-image

Imbalanced Data Problems

  1. Fraud Detection
  2. Disease Diagnosis
  3. Occurrence of Natural Disaster
  4. Customer Churn
  5. Spam Mail Detection
  6. … and many more examples

Handling Imbalanced Data

  1. Selecting correct model evaluation metrics. In classification problems we evaluate our models with various metrics such as accuracy, precision, recall, F1-Score etc. The choice of a suitable metric is key to correctly evaluating the model. Accuracy is one of the mostly used metric, accuracy only works well when the data is balanced. For imbalanced data accuracy will give wrong interpretation and hence need to use other metrics such as precision or recall. However, these evaluation metrics will only work well when the data is not highly imbalanced. For highly imbalanced data we need to try other approaches.
  2. Collecting more data. If feasible collecting more data will increase the samples of minority datapoints.
  3. Using suitable model. Not all machine learning models are affected by imbalanced data. Algorithms such as XGBoost are resistant to imbalanced data and are able to efficiently handle imbalanced data.
  4. Using K-fold cross-validation. By applying cross-validation we repeatedly sample the data into various randomly sampled datasets.
  5. Over-sampling. The most common method for dealing with imbalanced data is oversampling. Oversampling is appropriate when we have few samples hence we don’t want to discard any data. In oversampling we can use any of the techniques below;
    1. Simple random oversampling. This involves repeating the minority values and creating duplicates until the sample size of the minority class equals the majority class.
    2. Oversampling by shrinkage. Instead of repeating and duplicating minority samples we create new samples by adding noise to the minority class samples.
    3. Oversampling using Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE increases the minor samples by creating synthetic samples using various nearest neighbors method. This ensures that the oversampled data contains useful      information rather than mere duplicates.
  6. Under-sampling. This involves downsizing samples from the majority class to be equal to minority samples. We have the following 3 approaches for under-sampling imbalanced data;
    1. Simple random undersampling. This is a technique of discarding most of the samples from majority class until they equal the minority class samples. The disadvantage of this approach it leads to loss of information from discarded samples.
    2. Using K-Means for undersampling. This involves creating clusters from the majority class samples. Method that under samples the majority class by replacing a cluster of majority samples by the cluster centroid of a KMeans algorithm. This algorithm keeps N majority samples by fitting the KMeans algorithm with N cluster to the majority class and using the coordinates of the N cluster centroids as the new majority samples.
    3. iii. Undersampling with Tomek links. This technique involves removal of samples with Tokem links. Tomek link is when two samples of different classes are close together. With Tomek links we remove majority samples close to the minority class.
  7. Combining Oversampling and Undersampling. This approach allows us to leverage the strengths of Oversampling and Undersampling to achieve a balanced dataset.

Load Data

                    

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()

imbalanced-data-titanic-dataframe

Check for Class Imbalance in Data

                    

titanic_df['Survived'].value_counts().plot(kind='bar',color=['skyblue','orange'])

imbalanced-data-check-class-imbalance

Handling Imbalanced Data

1. K-Fold Cross-Validation

Sklearn has KFold(n_splits=5, *, shuffle=False, random_state=None) class that splits the dataset into k consecutive folds (without shuffling by default). the n_splits sets the number of folds.

 

Let’s fit a regression model on titanic data with KFold cross validation

                    

# import required libraries
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
#Get features and target variables from data
X=titanic_df[['Pclass','Age','Siblings/Spouses Aboard','Parents/Children Aboard','Fare']]
y=titanic_df['Survived']
# prepare cross-validation data with 10 folds
cv=KFold(n_splits=10, shuffle=True, random_state=1)
# model
model=LogisticRegression()
#Evaluate model with cv
score=cross_val_score(model,X,y,scoring='accuracy',cv=cv)
print('Min Accuracy %.2f'% (score.min()*100),'%')
print('Max Accuracy %.2f'% (score.max()*100),'%')
print('Mean Accuracy %.2f'% (np.mean(score)*100),'%')

imbalanced-data-kfold-accuracy-results

Let’s determine the value of k (folds) with higher acurracy

                    

def model_evaluation(cv,X,y):
    model=LogisticRegression()
    score=cross_val_score(model,X,y,scoring='accuracy',cv=cv)
    return score.min(), score.max(),np.mean(score)
    
folds=range(2,21)
min_cv,max_cv,mean_cv=list(),list(),list()
#iterate through each value of fold/k
for k in folds:
    cv=KFold(n_splits=k,shuffle=True,random_state=1)
    k_min,k_max,k_mean=model_evaluation(cv,X,y)
    min_cv.append(k_min*100),max_cv.append(k_max*100),mean_cv.append(k_mean*100)
    print('Folds=%d || Min Acc. %.2f || Max Acc. %.2f || Mean Acc.%.2f' % (k,k_min,k_max,k_mean))

imbalanced-data-kfold-k-value

Plot k values 

                    

import matplotlib.pyplot as plt
plt.style.use('ggplot') # Other sytles to use; fivethirtyeight

plt.figure(figsize=(20,10)) # Set figure size
plt.rcParams.update({'font.size': 22}) # Set axes size
plt.plot(folds,min_cv,color='orange',marker='*') # Plot the minimum accuracy
plt.plot(folds,max_cv,color='skyblue',marker='*') # Plot the maximum accuracy
plt.plot(folds,mean_cv,color='green',marker='*') # Plot the mean accuracy
plt.title('K-Fold',fontsize=24)
plt.xticks(rotation=45)
plt.xlabel('K',fontsize=24)
plt.ylabel('Accuracy',fontsize=24)
plt.legend(['Min','Max','Mean'], loc='upper right')
plt.show()

imbalanced-data-kfold-k-value-plot

2. Oversampling

i. Simple random oversampling

We will use imbalanced-learn which is a python package for data re-sampling. To install imbalanced-learn open terminal and run the below command

                    

pip install imbalanced-learn

When using Anaconda run the below command on Anaconda prompt terminal

                    

conda install -c conda-forge imbalanced-learn

                    

from imblearn.over_sampling import RandomOverSampler

ros=RandomOverSampler()
X_resapmled,y_resampled=ros.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])

imbalanced-data-simple-random-oversampling

ii. Oversampling by shrinkage

We use imbalanced-learn library which comes with RandomOverSampler classs but in this case we add shrinkage parameter

                    

from imblearn.over_sampling import RandomOverSampler

ros=RandomOverSampler(shrinkage=0.15)
X_resapmled,y_resampled=ros.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])

imbalanced-data-simple-random-oversampling-by-shrinkage

iii. Oversampling with SMOTE

The imbalanced-learn package comes with SMOTE class for synthetic sampling

                    

from imblearn.over_sampling import SMOTE

smote=SMOTE()
X_resampled,y_resampled=smote.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])

imbalanced-data-smote-oversampling

3. Undersampling

i. Simple random undersampling

We use imbalanced-learn which comes with RandomUnderSampler class for undersampling data

                    

from imblearn.under_sampling import RandomUnderSampler

rus=RandomUnderSampler()
X_resampled,y_resampled=rus.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])

imbalanced-data-simple-random-undersampling.png

ii. Undersampling with K-Means

The imbalanced-learn library contains the ClusterCentroids class for creating clusters for undersampling

                    

from imblearn.under_sampling import ClusterCentroids

cc=ClusterCentroids()
X_resampled,y_resampled=cc.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])

imbalanced-data-undersampling-with-kmeans

The imbalanced-learn library comes with TomekLinks class which we use for undersampling. Note that the resultant samples for each class are not the same. This is because Tomek Links only removes the samples from majority class that are close to minority class.

                    

from imblearn.under_sampling import TomekLinks

tl=TomekLinks()
X_resampled,y_resampled=tl.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange']);p[

imbalanced-data-undersampling-with-tomek

4. Combining Oversampling and Undersampling

To get a robust way to deal with imbalanced data we can combine both oversampling and undersampling techniques. The imbalanced-learn library comes with a SMOTETomek class which is used for oversampling and undersampling.

                    

from imblearn.combine import SMOTETomek

st=SMOTETomek()
X_resampled,y_Resampled=st.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])

imbalanced-data-smotetomek

For complete code check the notebook here.

Conclusion

Imbalanced data can greatly affect machine learning model for classification problems and results to wrong or poor performance. However, not all machine learning are sensitive to imbalanced data. Algorithms such as XGBoost internally handles imbalanced data efficiently. In this post we have looked at various techniques of dealing with imbalanced such as resampling, selecting suitable evaluation metric, increasing data size among others. The above approaches introduces you on the common techniques used for handling imbalanced data. Selecting suitable approach is based on experimental evaluation of each technique and choosing the one that yield high performance. In the previous post we learnt about how to handle missing values, check the post here. In the next post we will learn about how to detect and handle outliers in dataset.

Handling Imbalanced Data

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x