Feature selection is a feature engineering process (sometimes considered as data pre-processing method) of choosing the variables with high predictive power from the dataset. It can be conducted for both supervised and unsupervised learning. For supervised approach we statistically evaluate the relationship between input variables and the output and determine which input features to eliminate. We can use various methods broadly categorized under filter, wrapper and intrinsic. The key challenge in selecting best feature is dependent on the type of data for both input and output variables. In this post we will explore the most common feature selection techniques and how to implement them in Python. Download the dataset for the post here.
Importance of Feature Selection
- Improve model accuracy. Since we only select the features with optimal predictive power and discard weak features our model improves in accuracy.
- Reduces model training time. The fewer the features the faster our model will train.
- Reduces the “curse of dimensionality”. It is hard to analyse and model data with many features. Feature selection helps in reducing unnecessary features.
- Reduces model overfitting. Overfitting occurs when the model memorizes training data and performs poorly on test or data in production. This occurs as a results of model learning patterns from irrelevant features.
Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from IPython.display import display_html
import matplotlib.pyplot as plt
import seaborn as sns
# Configure visual display properties
sns.set_theme(style="whitegrid")
sns.set(rc={'figure.figsize':(20,10)})
Load Data
iris_df=pd.read_csv('iris.csv')
iris_df.head()
Scale Dataset
We standardize our dataset to have a mean of 0 and a variance of 1 deviation
features=iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] # features
y=iris_df['class'] #target
X=StandardScaler().fit_transform(features) # scale the data
X_scaled=pd.DataFrame(X, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']) # create dataframe of scaled data
scaled_df=pd.concat([X_scaled,y],axis=1) # join scaled dataframe with target variable
# Display the two dataframes side by side
iris_df = iris_df.head().style.set_table_attributes("style='display:inline'").set_caption('Original DataFrame')
scaled_df = scaled_df.head().style.set_table_attributes("style='display:inline'").set_caption('Scaled DtaFrame')
display_html(iris_df._repr_html_()+" "+scaled_df._repr_html_(), raw=True)
Feature Selection Techniques
Some algorithms such as tree-based have an in-built feature selection capability that evaluates the importance of each feature in relation others. Below are the common feature selection techniques.
- Univariate Feature Selection. This is a common method for selecting relevant features based on univariate statistical tests. The implementation in scikit-learn provides us with different ways of setting the threshold such as number of features and the percentage of features to select. Sklearn provides us with 3 ways to select univarate features based i=on input and output data types as follows;
i. Regression Feature Selection. This is where both the input and output variables are numerical. In sklearn score_func we use f_regression which is based on Pearsons Correlation Coefficient.
ii. Non-negative Feature Selection. This is where both the input variables are numeric and non-negative and output variables is categorical. In sklearn score_func we use chi2.
i. Univariate Feature Selection using SelectKBest.
We use chi2 as our score function because the input features are non-negative and the output is categorical. For categorical output variable we use f_classif scoring function. For numerical output variable we use r_regression scoring function.
from sklearn.feature_selection import SelectKBest,chi2
select_k_best=SelectKBest(chi2,k=3)
select_k_best_fit=select_k_best.fit(features,y)
k_best=pd.DataFrame(select_k_best_fit.scores_,columns=['Score'],index=features.columns).sort_values(by='Score',ascending=False)
k_best
Visualize k best features
sns.barplot(y=k_best.index, x=k_best.Score)
plt.xlabel('Feature Score')
plt.ylabel('Features')
plt.title("Visualizing Best Features")
plt.show()
Select 3 best features
print(k_best.nlargest(3,'Score'))
petal_length 116.169847
petal_width 67.244828
sepal_length 10.817821
dtype: float64
ii. Univariate Feature Selection using SelectPercentile
We use chi2 as our score function because the input features are non-negative and the output is categorical. For categorical output variable we use f_classif scoring function. For numerical output variable we use r_regression scoring function.
from sklearn.feature_selection import SelectPercentile, chi2
select_p_best=SelectPercentile(chi2,percentile=10)
select_p_best_fit=select_p_best.fit(features,y)
p_best=pd.DataFrame(select_p_best_fit.scores_,columns=['Score'],index=features.columns).sort_values(by='Score',ascending=False)
p_best
Select best 3 features
print(p_best['Score'].nlargest(3))
iii. Classification Feature Selection. This is where the input variable is numerical and output variable is categorical. In sklearn score_func we use f_classif.
- Feature importance for Tree-based. Tree based algorithm such as Random Forest can be used for feature selection tasks. Random Forest has a feature importance property that orders features importance based on the following techniques;
i. Built-in feature importance,
from sklearn.ensemble import RandomForestClassifier
rf_clf=RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_scaled,y) # train the model
feature_imp = pd.DataFrame(rf_clf.feature_importances_,columns=['Score'],index=features.columns).sort_values(by='Score',ascending=False)
feature_imp
Visualize Feature importance
sns.barplot(y=feature_imp.index, x=feature_imp['Score'])
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()
Select 3 best features
print(feature_imp['Score'].nlargest(3))
ii. Permutation based importance and
iii. SHAP values based importance.
- L1 Regularization. In machine learning regularization is used to reduce overfitting of regression models. It is a technique for of penalizing large errors in the model by reducing errors to zero or as minimum as possible for features. The three common regularization techniques for regression model is LASSO (Least Absolute Shrinkage and Selection Operator) regression also known as L1 norm, Ridge regression and elastic net regression. LASSO regression is used for feature selection since it penalizes the model coefficient to zero by adding absolute values of coefficient as penalty term. On the other hand the Ridge regression doesn’t reduce the error to zero as it adds the squared values of the coefficients as penalty term. The Elastic Net Regression is a combination of LASSO and Ridge Regression. The property of LASSO regression of reducing coefficient error to zero makes it useful in feature selection.
from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel
model=LogisticRegression(C=1,solver='liblinear', penalty='l1')
select_model=SelectFromModel(model,threshold=1.0)
select_model.fit(X_scaled,y)
Features with 0 coefficients
np.round(select_model.estimator_.coef_,1)
Get features support
select_model.get_support()
array([False, True, True, True])
Get remaining features
X_scaled.columns[(select_model.get_support())]
Index([‘sepal_width’, ‘petal_length’, ‘petal_width’], dtype=’object’)
4. Low Variance Filter. In this technique we remove either variables that has low or constant variance. We first normalize our data and set a variance threshold. We then compute the variance between the variables and for variables with variance less than the threshold we remove one of them. The low variance features don’t add much value to model prediction hence we discard them.
- Low Variance filter using sklearn VarianceThreshold
from sklearn.feature_selection import VarianceThreshold
threshold_value=0.5
vt=VarianceThreshold(threshold=threshold_value)
vt_fit=vt.fit_transform(features)
Get variance for each feature
print(vt.variances_) # actual variance values
[0.68112222 0.18675067 3.09242489 0.57853156]
i. Get features with high variance
feature_variance=pd.DataFrame(vt.variances_,columns=['Score'],index=features.columns).sort_values(by='Score',ascending=False)
feature_variance
Visualize Feature Variance
sns.barplot(y=feature_variance.index, x=feature_variance['Score'])
plt.xlabel('Feature Variance')
plt.ylabel('Features')
plt.title("Visualizing Feature Variance")
plt.show()
Select best 3 features
print(feature_variance['Score'].nlargest(3))
ii. Get columns with low variance based on 0.05 threshold i.e value with similarity of 50%
print(vt.variances_) # actual variance values
print(vt.get_support()) # True has high variance False has low variance than the threshold
low_var_=[column for column in features.columns if column not in features.columns[vt.get_support()]]
low_var_
remove_low_vars_=features.drop(low_var_,axis=1)
# Drop features with variance less than 50%
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
low_vars_df_ = remove_low_vars_.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')
display_html(iris_df._repr_html_()+" "+low_vars_df_._repr_html_(), raw=True)
iii. Get Feature Variance from DataFrame
variance=features.var()
variance
Low Variance Feature
low_var_features = [ ]
for i in range(0,len(variance)):
if variance[i]<=0.50:
low_var_features.append(features.columns[i])
low_var_features
[‘sepal_width’]
Remove low variance feature
remove_low_var_features=features.drop(low_var_features,axis=1)
remove_low_var_features.head()
# Drop features with variance less than 50%
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
low_var_features_df = remove_low_var_features.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')
display_html(iris_df._repr_html_()+" "+low_var_features_df._repr_html_(), raw=True)
5. High Correlation Filter. When two features/variables are highly correlated, we have to remove one of them since this leads to multicollinearity. Multicollinearity affects the interpretation of machine learning model and makes the model complex due to highly correlated features.
Get the correlation matrix first
cor_matrix=features.corr()
cor_matrix
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.heatmap(cor_matrix, annot=True, fmt=".4f", linewidths=.5,cmap='winter')
plt.show()
Remove correlated features that are duplicates
cor_matrix.unstack().sort_values().drop_duplicates() # drop duplicates features
Remove correlated features based on defined correlation threshold
# Create an upper triangle matrix with np.triu
upper_triangle_martix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
upper_triangle_martix
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['font.size'] = 20
sns.heatmap(upper_triangle_martix, annot=True, fmt=".4f", linewidths=.5,cmap='Greens')
plt.show()
High correlated feature
correlated_features=[column for column in upper_triangle_martix.columns if any(upper_triangle_martix[column] > 0.90)]
correlated_features
[‘petal_width’]
remove_correlated_features=features.drop(correlated_features,axis=1)
# Display the two dataframes side by side
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
uncorrelated_df = remove_correlated_features.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Correlated Features')
display_html(iris_df._repr_html_()+" "+uncorrelated_df._repr_html_(), raw=True)
For complete code check feature engineering notebook here.
Conclusion
Feature selection is an important step in machine learning model development. It allows us to select the most relevant features that have high predictive power. The benefits of feature selection include; improved model performance (in terms of accuracy, precision etc.), reduces the dimensionality of data which makes analysis and modelling of data easy, fewer features makes it faster for the model to train as it requires fewer compute resources. We have seen several techniques of feature selection. In the next post we will look at Feature Encoding for categorical variables. To learn about Dimensionality Reduction check our previous post here.