In machine learning classification problem imbalanced dataset affects the performance of the model. Imbalance data is when the classes in the target variable have unequal distributions in the dataset. If the target variable has two classes A and B, if the number of samples in class A are 90% and that of B are 10% then the data is imbalanced. This affects most machine learning algorithms since they are not resistant to imbalanced data. However, some machine learning algorithms such as decision tree have the capability of handling imbalanced data. There are various techniques of handling imbalanced data and the choice of each approach requires you to experiments on various techniques to arrive at the one that yields an optimum performance for the model.
Imbalanced Data Problems
- Fraud Detection
- Disease Diagnosis
- Occurrence of Natural Disaster
- Customer Churn
- Spam Mail Detection
- … and many more examples
Handling Imbalanced Data
- Selecting correct model evaluation metrics. In classification problems we evaluate our models with various metrics such as accuracy, precision, recall, F1-Score etc. The choice of a suitable metric is key to correctly evaluating the model. Accuracy is one of the mostly used metric, accuracy only works well when the data is balanced. For imbalanced data accuracy will give wrong interpretation and hence need to use other metrics such as precision or recall. However, these evaluation metrics will only work well when the data is not highly imbalanced. For highly imbalanced data we need to try other approaches.
- Collecting more data. If feasible collecting more data will increase the samples of minority datapoints.
- Using suitable model. Not all machine learning models are affected by imbalanced data. Algorithms such as XGBoost are resistant to imbalanced data and are able to efficiently handle imbalanced data.
- Using K-fold cross-validation. By applying cross-validation we repeatedly sample the data into various randomly sampled datasets.
- Over-sampling. The most common method for dealing with imbalanced data is oversampling. Oversampling is appropriate when we have few samples hence we don’t want to discard any data. In oversampling we can use any of the techniques below;
- Simple random oversampling. This involves repeating the minority values and creating duplicates until the sample size of the minority class equals the majority class.
- Oversampling by shrinkage. Instead of repeating and duplicating minority samples we create new samples by adding noise to the minority class samples.
- Oversampling using Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE increases the minor samples by creating synthetic samples using various nearest neighbors method. This ensures that the oversampled data contains useful information rather than mere duplicates.
- Under-sampling. This involves downsizing samples from the majority class to be equal to minority samples. We have the following 3 approaches for under-sampling imbalanced data;
- Simple random undersampling. This is a technique of discarding most of the samples from majority class until they equal the minority class samples. The disadvantage of this approach it leads to loss of information from discarded samples.
- Using K-Means for undersampling. This involves creating clusters from the majority class samples. Method that under samples the majority class by replacing a cluster of majority samples by the cluster centroid of a KMeans algorithm. This algorithm keeps N majority samples by fitting the KMeans algorithm with N cluster to the majority class and using the coordinates of the N cluster centroids as the new majority samples.
- iii. Undersampling with Tomek links. This technique involves removal of samples with Tokem links. Tomek link is when two samples of different classes are close together. With Tomek links we remove majority samples close to the minority class.
- Combining Oversampling and Undersampling. This approach allows us to leverage the strengths of Oversampling and Undersampling to achieve a balanced dataset.
Load Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()
Check for Class Imbalance in Data
titanic_df['Survived'].value_counts().plot(kind='bar',color=['skyblue','orange'])
Handling Imbalanced Data
1. K-Fold Cross-Validation
Sklearn has KFold(n_splits=5, *, shuffle=False, random_state=None) class that splits the dataset into k consecutive folds (without shuffling by default). the n_splits sets the number of folds.
Let’s fit a regression model on titanic data with KFold cross validation
# import required libraries
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
#Get features and target variables from data
X=titanic_df[['Pclass','Age','Siblings/Spouses Aboard','Parents/Children Aboard','Fare']]
y=titanic_df['Survived']
# prepare cross-validation data with 10 folds
cv=KFold(n_splits=10, shuffle=True, random_state=1)
# model
model=LogisticRegression()
#Evaluate model with cv
score=cross_val_score(model,X,y,scoring='accuracy',cv=cv)
print('Min Accuracy %.2f'% (score.min()*100),'%')
print('Max Accuracy %.2f'% (score.max()*100),'%')
print('Mean Accuracy %.2f'% (np.mean(score)*100),'%')
Let’s determine the value of k (folds) with higher acurracy
def model_evaluation(cv,X,y):
model=LogisticRegression()
score=cross_val_score(model,X,y,scoring='accuracy',cv=cv)
return score.min(), score.max(),np.mean(score)
folds=range(2,21)
min_cv,max_cv,mean_cv=list(),list(),list()
#iterate through each value of fold/k
for k in folds:
cv=KFold(n_splits=k,shuffle=True,random_state=1)
k_min,k_max,k_mean=model_evaluation(cv,X,y)
min_cv.append(k_min*100),max_cv.append(k_max*100),mean_cv.append(k_mean*100)
print('Folds=%d || Min Acc. %.2f || Max Acc. %.2f || Mean Acc.%.2f' % (k,k_min,k_max,k_mean))
Plot k values
import matplotlib.pyplot as plt
plt.style.use('ggplot') # Other sytles to use; fivethirtyeight
plt.figure(figsize=(20,10)) # Set figure size
plt.rcParams.update({'font.size': 22}) # Set axes size
plt.plot(folds,min_cv,color='orange',marker='*') # Plot the minimum accuracy
plt.plot(folds,max_cv,color='skyblue',marker='*') # Plot the maximum accuracy
plt.plot(folds,mean_cv,color='green',marker='*') # Plot the mean accuracy
plt.title('K-Fold',fontsize=24)
plt.xticks(rotation=45)
plt.xlabel('K',fontsize=24)
plt.ylabel('Accuracy',fontsize=24)
plt.legend(['Min','Max','Mean'], loc='upper right')
plt.show()
2. Oversampling
We will use imbalanced-learn which is a python package for data re-sampling. To install imbalanced-learn open terminal and run the below command
pip install imbalanced-learn
When using Anaconda run the below command on Anaconda prompt terminal
conda install -c conda-forge imbalanced-learn
from imblearn.over_sampling import RandomOverSampler
ros=RandomOverSampler()
X_resapmled,y_resampled=ros.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])
ii. Oversampling by shrinkage
We use imbalanced-learn library which comes with RandomOverSampler classs but in this case we add shrinkage parameter
from imblearn.over_sampling import RandomOverSampler
ros=RandomOverSampler(shrinkage=0.15)
X_resapmled,y_resampled=ros.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])
iii. Oversampling with SMOTE
The imbalanced-learn package comes with SMOTE class for synthetic sampling
from imblearn.over_sampling import SMOTE
smote=SMOTE()
X_resampled,y_resampled=smote.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])
3. Undersampling
We use imbalanced-learn which comes with RandomUnderSampler class for undersampling data
from imblearn.under_sampling import RandomUnderSampler
rus=RandomUnderSampler()
X_resampled,y_resampled=rus.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])
ii. Undersampling with K-Means
The imbalanced-learn library contains the ClusterCentroids class for creating clusters for undersampling
from imblearn.under_sampling import ClusterCentroids
cc=ClusterCentroids()
X_resampled,y_resampled=cc.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])
iii. Undersampling using Tomek links
The imbalanced-learn library comes with TomekLinks class which we use for undersampling. Note that the resultant samples for each class are not the same. This is because Tomek Links only removes the samples from majority class that are close to minority class.
from imblearn.under_sampling import TomekLinks
tl=TomekLinks()
X_resampled,y_resampled=tl.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange']);p[
4. Combining Oversampling and Undersampling
To get a robust way to deal with imbalanced data we can combine both oversampling and undersampling techniques. The imbalanced-learn library comes with a SMOTETomek class which is used for oversampling and undersampling.
from imblearn.combine import SMOTETomek
st=SMOTETomek()
X_resampled,y_Resampled=st.fit_resample(X,y)
y_resampled.value_counts().plot(kind='bar',color=['skyblue','orange'])
For complete code check the notebook here.
Conclusion
Imbalanced data can greatly affect machine learning model for classification problems and results to wrong or poor performance. However, not all machine learning are sensitive to imbalanced data. Algorithms such as XGBoost internally handles imbalanced data efficiently. In this post we have looked at various techniques of dealing with imbalanced such as resampling, selecting suitable evaluation metric, increasing data size among others. The above approaches introduces you on the common techniques used for handling imbalanced data. Selecting suitable approach is based on experimental evaluation of each technique and choosing the one that yield high performance. In the previous post we learnt about how to handle missing values, check the post here. In the next post we will learn about how to detect and handle outliers in dataset.