Feature scaling is a technique of normalizing or standardizing data into a certain range suitable for fitting a machine learning algorithm. If we don’t scale our data for example we have a variable called age with values in range 12 to 98 some algorithms will give more weight to 98 and less to 12. This is bad and misleading hence the need to scale the variable to a fixed range. Feature scaling comes handy depending on the algorithms, for example gradient based algorithms train faster when features are scaled, it also help in reducing the chances of the model getting stuck on local optima. For class of algorithms such as distance based feature scaling helps the effect of computing large ranges in the dataset. To scale or not to scale our data largely depends on the kind of algorithm in use. Download the data for this post here.
Feature Scaling Technique
Now that we have decided that we have to scale our data, there are various techniques to use depending on the nature of the data. Sometimes deciding on which technique to use is difficult and requires an empirical approach to evaluate which one performs better than the other. One key item to note while performing feature scaling is avoiding data leakage. Data leakage can occur when we perform feature scaling to all data before splitting to training and validation set. We have to split the data to training and validation sets first then perform feature scaling on training set and use the training set to transform validation set. Below are common feature scaling techniques.
- Min-max Normalization. Also known as min-max scaling, it is a technique where the values are shifted to a range between 0 and 1. It follows the formula; X’=x-xmin / (xmax – xmin) This outputs new values in range of 0 to 1. Min-max normalization has proved to work well with data that does not follow Gaussian distribution. Apart from being a useful technique this approach does not help in dealing with outliers in data as compared to standardization technique.
- Standardization. Also referred to as Z-score normalization is a technique that transforms data to have a mean of 0 and a standard deviation of 1. It follows the following equation;
X’=(x-xmean)/standard deviation.
It works well when the data assumes Gaussian distribution. This technique is ideal in handling outliers in the dataset.
- Mean normalization. This is one of the least used technique in feature scaling. It works by subtracting each value in the feature from the mean. It follows the following equation;
X’=(x-xmean)/ (xmax-xmin).
- Unit Length Scaling. In this approach we scale the features so that we can end up with a vector of one length. It follows the formula; x’=x/(||x||).
Feature Scaling
In this section we will use titanic dataset and scale the age and fare features using different techniques.
Load required libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
Load data
titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()
Min-max normalization
features_to_scale=titanic_df[['Age','Fare']]
min_max_scaler=MinMaxScaler(feature_range=(0,1)) # You can adjust the range to get the best distribution
min_max_normalized=min_max_scaler.fit_transform(features_to_scale)
min_max_normalized_df=pd.DataFrame(min_max_normalized,columns=['Normalized_Age','Normalized_Fare'])
scaled_df=pd.concat([features_to_scale,min_max_normalized_df],axis=1)
scaled_df.head()
Standardization (Z-score normalization)
standardized=StandardScaler()
standardized_df=standardized.fit_transform(features_to_scale)
standardized_df=pd.DataFrame(standardized_df, columns=['Standardized_Age','Standardized_Fare'])
scaled_df=pd.concat([scaled_df,standardized_df], axis=1)
scaled_df.head()
Let’s check the distribution of gender before standardization and after standardization
Before Standardization
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.kdeplot(data=scaled_df, x="Age", fill=True, common_norm=False, palette="crest", alpha=.5, linewidth=0,)
plt.show()
After Standardization
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.kdeplot(data=scaled_df, x="Standardized_Age", fill=True, common_norm=False, palette="crest", alpha=.5, linewidth=0,)
sns.kdeplot(data=scaled_df, x="Normalized_Age", fill=True, common_norm=False, palette="crest", alpha=.5, linewidth=0,)
plt.legend(labels=["Standardized","Normalized"])
plt.show()
For complete code check the notebook here.
Conclusion
Deciding whether to scale or not scale dataset and the technique to use is usually not a straight forward question and depends on various factors such as the algorithm being used. It is important to try different data scaling techniques and assessing the performance of the model. In this post we have looked at what is feature scaling, its benefits and when to use it. We have also looked at most common techniques for scaling features and how to implement them in Python. In the next post we will look at Detecting Multicollinearity in data. To learn about Feature Encoding check our previous post here.