One key challenge with most of the machine learning algorithm is the inability to work with categorical variables. They need the features to be converted to numeric form. The processing of transforming categorical features to numerical form is referred to as feature encoding. Feature encoding improves the performance of the model. It’s a key step in machine learning modelling phase. In this post we will look at different techniques of feature encoding and there pros and cons. Download the dataset for this post here.
Types of Categorical Variables
- Binary. True or False, Yes or No etc.
- Ordinal. These are ordered variables such as temperature (cold, warm, hot, very hot).
- Nominal. These are variables without any specific order such as food (bread, rice, pizza).
Feature Encoding Techniques
- Label Encoding. This is a common feature encoding techniques that is used for ordinal categorical data. It assigns each label a unique number based on alphabetical ordering. This approach should be used for ordinal data where the order of the data points should be preserved. It is not ideal for nominal data where order of items is immaterial, and instead we use other techniques such as one-hot encoding.
- One-Hot Encoding. One-Hot encoding is another commonly used technique for feature encoding of nominal data. For a unique number of values in a given features it creates additional columns corresponding to the number of unique values in that feature. It then assigns 0 labels to other newly created features and 1 where the value exists. If you have a feature let’s call it fruits and you have 3 unique values (banana, orange, and apple), one-hot encoding will create three columns for each fruit. Where a specific fruit has a value in the data it will assign 1 and the rest to 0. The challenge with this technique is that it results to large number of columns in cases where we have many categorical values. This approach leads to features explosion causing the curse of dimensionality problem.
- Frequency Encoding. Also sometimes referred to as count encoding, we count the number of observation for each category and replace the specific value with its equivalent number of occurrence. For example if we have a feature called fruits and has three types (banana, orange, apple) we count how many times banana occurs in the data. If it’s 10 times we replace banana with 10 and so on for the other types. We now use the frequencies as labels. We can also use percentages instead of counts where we calculate the count of each fruit against the total count of all fruits in the feature. This approach has various benefits such as ;
i. Does not lead to explosion in features as compared to one-hot encoding.
ii. Can lead to much better performance if the target variable is somewhat correlated to the frequency.
- Binary Encoding. This techniques involves the conversion of each categorical value and converted to numerical value with order preserved (ordinal encoding) and then converted to its binary equivalent. Then the binary values are split into their constituent columns.
- Hash Encoding. Hash encoding uses a hash function to map each categorical value in the variable to a unique random number. It’s similar to one-hot encoding but more flexible as it allows us to determine the number of dimension we want to output. The advantage of the hash encoding is lower number of dimensions. However, it can lead to collision of hash numbers in cases where we have large number of categorical features.
- Mean / Target Encoding. Mean target encoding works by computing the average of the target variable for each value of the categorical variable and replacing the calculated mean with the categorical value. For example if we intent to predict the price of fruits as our target variable and have a categorical column such as fruit type with values (apple, orange, etc.) for each fruit type we replace with its target mean/average. This approach is useful when we have large number of categorical values in a feature. However, when not properly done can lead to target leakage. Another challenge is that it will results to wrong results if we have imbalanced data. Also since it heavily depends on the mean it might be affected by outliers in the dataset.
- Weight of Evidence Encoding. In the WoE encoding we replace each category with the natural log of [P(1) / P(0)] where P(1) is the probability of one class, and P(0) the probability of another class if our target variable has 1 and 0. This approach is ideal when using Logistic Regression model.
Feature Encoding Python Implementation
Import required Libraries
import pandas as pd
import numpy as np
fruits_df=pd.DataFrame( {'Fruit_Type': ['Banana','Orange','Apple','Mangoe','Orange'],
'Weight':[50.0, 55.0,53.0,62.0,56.12],
'Price':[0.5,0.6,0.65,0.55,0.62]})
fruits_df
- Label Encoding
Label Encoding converts categorical values to numeric by ordering them alphabetically
Approach 1 using sklearn
from sklearn.preprocessing import LabelEncoder
le_df=fruits_df
le=LabelEncoder()
le_df['label_encoded_fruit_type']=le.fit_transform(le_df['Fruit_Type'])
le_df
Approach 2. Using category_encoder library
This is a python library that contains various feature encoding functions. To install open terminal and run below command:
pip install category_encoders
import category_encoders as ce
le_ce_df=fruits_df
le_ce=ce.OrdinalEncoder()
le_ce_df['ordinal_encoded_fruit_type']=le.fit_transform(le_ce_df['Fruit_Type'])
le_ce_df
- One-Hot Encoding
Approach 1: Using category_encoder
ohe_ce=ce.OneHotEncoder(cols='Fruit_Type',handle_unknown='return_nan',return_df=True, use_cat_names=True)
ohe_fruits_df=ohe_ce.fit_transform(fruits_df['Fruit_Type'])
ohe_fruits_df=pd.concat([fruits_df[['Weight','Price']],ohe_fruits_df],axis=1) # Join the encoded dataframe with original dataframe
ohe_fruits_df
Approach 2: Using pandas get_dummies()
dummy_encode=pd.get_dummies(data=fruits_df['Fruit_Type'],drop_first=True)
dummy_encode=pd.concat([fruits_df[['Weight','Price']],dummy_encode],axis=1) # Join the encoded dataframe with original dataframe
dummy_encode
- Frequency Encoding
freq_encoding_df=fruits_df
fruit_type_count = freq_encoding_df['Fruit_Type'].value_counts().to_dict() # return counts of each unique type
freq_encoding_df['Fruit_Type'] = freq_encoding_df['Fruit_Type'].map(fruit_type_count)
print(fruit_type_count)
freq_encoding_df
- Binary Encoding
binary_encoded_df=fruits_df
binary_en=ce.BinaryEncoder(cols=['Fruit_Type'],return_df=True)
binary_encoded_df=binary_en.fit_transform(binary_encoded_df)
binary_encoded_df
- Hash Encoding
hashing_encoded_df=fruits_df
h_encoded=ce.HashingEncoder(n_components=3, cols=['Fruit_Type'])
hashing_encoded_df=h_encoded.fit_transform(hashing_encoded_df)
hashing_encoded_df
- Mean Target Encoding
mean_target_df=fruits_df
mean_target=ce.TargetEncoder(cols=['Fruit_Type'])
mean_target_df['Mean_Target_Fruit_Type']=mean_target.fit_transform(mean_target_df['Fruit_Type'],mean_target_df['Price'])
mean_target_df
- Weight of Evidence Encoding
Load titanic data
titanic_df=pd.read_csv('titanic.csv')
titanic_df.head()
Approach 1: Using WOEEncoder
WOEEncoder from category_encoder provides us with different parameter to adjust to get desirable results. For full details on how to use check the sklearn category_encoder documentation here.
woe_df=titanic_df
woe=ce.WOEEncoder(cols=['Sex'])
woe_df['WoE_Sex']=woe.fit_transform(woe_df['Sex'],woe_df['Survived'])
woe_df.head()
Approach 2: Calculating it manually
prob_df = pd.DataFrame(woe_df.groupby(['Sex'])['Survived'].mean()) # This gives us probability of sruvived
prob_df['died'] = 1-prob_df['Survived'] # This gives probability of dead
prob_df['woe'] = np.log( prob_df['Survived'] / prob_df['died'] ) # woe
print(prob_df['woe'].to_dict())
prob_df
Weight of Evidence Calculation
woe_df['WoE_Gender']=woe_df['Sex'].map(prob_df['woe'].to_dict())
woe_df.head()
Note that we get slightly different results from the sklearn category_encoder because in category_encoder we have parameter that we can tune to get different output. The above results are based on default parameter values. Try adjusting parameter such sigma and regularization and observe the results.
For complete code check the feature-engineering notebook here.
Conclusion
Feature Encoding is an important process of preparing data for modelling. Most machine learn algorithms don’t have internal mechanism to handle categorical data and rely on externally encoded data. In this post we have looked at different types of categorical data namely; binary, ordinal and nominal. We have looked at different techniques for encoding categorical data such as label encoding, one-hot encoding, mean target, hashing and Weight of Evidence approaches. In the next post we will look at Feature Scaling and various techniques. To learn about Feature Selection check our previous post here.