Feature selection is a feature engineering process (sometimes considered as data pre-processing method) of choosing the variables with high predictive power from the dataset. It can be conducted for both supervised and unsupervised learning. For supervised approach we statistically evaluate the relationship between input variables and the output and determine which input features to eliminate. We can use various methods broadly categorized under filter, wrapper and intrinsic. The key challenge in selecting best feature is dependent on the type of data for both input and output variables. In this post we will explore the most common feature selection techniques and how to implement them in Python. Download the dataset for the post here.

feature-selection-image

Importance of Feature Selection

  1. Improve model accuracy. Since we only select the features with optimal predictive power and discard weak features our model improves in accuracy.
  2. Reduces model training time. The fewer the features the faster our model will train.
  3. Reduces the “curse of dimensionality”. It is hard to analyse and model data with many features. Feature selection helps in reducing unnecessary features.
  4. Reduces model overfitting. Overfitting occurs when the model memorizes training data and performs poorly on test or data in production. This occurs as a results of model learning patterns from irrelevant features.

Import Required Libraries

                    

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from IPython.display import display_html 
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visual display properties
sns.set_theme(style="whitegrid")
sns.set(rc={'figure.figsize':(20,10)})

Load Data

                    

iris_df=pd.read_csv('iris.csv')
iris_df.head()

seaborn-linear-regression-iris-data

Scale Dataset

We standardize our dataset to have a mean of 0 and a variance of 1 deviation

                    

features=iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] # features
y=iris_df['class'] #target

X=StandardScaler().fit_transform(features) # scale the data
X_scaled=pd.DataFrame(X, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']) # create dataframe of scaled data
scaled_df=pd.concat([X_scaled,y],axis=1) # join scaled dataframe with target variable

# Display the two dataframes side by side
iris_df = iris_df.head().style.set_table_attributes("style='display:inline'").set_caption('Original DataFrame')
scaled_df = scaled_df.head().style.set_table_attributes("style='display:inline'").set_caption('Scaled DtaFrame')

display_html(iris_df._repr_html_()+"     "+scaled_df._repr_html_(), raw=True)

dimesnionality-reduction-standardize-data

Feature Selection Techniques

Some algorithms such as tree-based have an in-built feature selection capability that evaluates the importance of each feature in relation others. Below are the common feature selection techniques.

  1. Univariate Feature Selection. This is a common method for selecting relevant features based on univariate statistical tests. The implementation in scikit-learn provides us with different ways of setting the threshold such as number of features and the percentage of features to select. Sklearn provides us with 3 ways to select univarate features based i=on input and output data types as follows;

                i. Regression Feature Selection. This is where both the input and output variables are                                 numerical. In sklearn score_func we use f_regression which is based on Pearsons                                       Correlation Coefficient.

              ii. Non-negative Feature Selection. This is where both the input variables are numeric and                        non-negative and output variables is categorical. In sklearn score_func we use chi2.

i. Univariate Feature Selection using SelectKBest.
We use chi2 as our score function because the input features are non-negative and the output is categorical. For categorical output variable we use f_classif scoring function. For numerical output variable we use r_regression scoring function.

                    

from sklearn.feature_selection import SelectKBest,chi2

select_k_best=SelectKBest(chi2,k=3)
select_k_best_fit=select_k_best.fit(features,y)
k_best=pd.DataFrame(select_k_best_fit.scores_,columns=['Score'],index=features.columns).sort_values(by='Score',ascending=False)
k_best

univariate-feature-selection-chi2

Visualize k best features

                    

sns.barplot(y=k_best.index, x=k_best.Score)
plt.xlabel('Feature Score')
plt.ylabel('Features')
plt.title("Visualizing Best Features")
plt.show()

univariate-feature-selection-chi2-visualize-kbest-features

Select 3 best features

                    

print(k_best.nlargest(3,'Score'))

petal_length 116.169847

petal_width 67.244828

sepal_length 10.817821

dtype: float64

ii. Univariate Feature Selection using SelectPercentile
We use chi2 as our score function because the input features are non-negative and the output is categorical. For categorical output variable we use f_classif scoring function. For numerical output variable we use r_regression scoring function.

                    

from sklearn.feature_selection import SelectPercentile, chi2

select_p_best=SelectPercentile(chi2,percentile=10)
select_p_best_fit=select_p_best.fit(features,y)
p_best=pd.DataFrame(select_p_best_fit.scores_,columns=['Score'],index=features.columns).sort_values(by='Score',ascending=False)
p_best

univariate-feature-selection-chi2-select-kbestkby-percentile

Select best 3 features

                    

print(p_best['Score'].nlargest(3))

univariate-feature-selection-chi2-select-kbestkby-percentile-selected

             iii. Classification Feature Selection. This is where the input variable is numerical and                                     output variable is categorical. In sklearn score_func we use f_classif.

  1. Feature importance for Tree-based. Tree based algorithm such as Random Forest can be used for feature selection tasks. Random Forest has a feature importance property that orders features importance based on the following techniques;

i. Built-in feature importance,

                    

from sklearn.ensemble import RandomForestClassifier

rf_clf=RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_scaled,y) # train the model

feature_imp = pd.DataFrame(rf_clf.feature_importances_,columns=['Score'],index=features.columns).sort_values(by='Score',ascending=False)
feature_imp

random-forest-feature-importance-score

Visualize Feature importance

                    

sns.barplot(y=feature_imp.index, x=feature_imp['Score'])
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()

random-forest-feature-importance-visualization

Select 3 best features

                    

print(feature_imp['Score'].nlargest(3))

random-forest-feature-importance-select-3-best-features

ii. Permutation based importance and

iii. SHAP values based importance.

  1. L1 Regularization. In machine learning regularization is used to reduce overfitting of regression models. It is a technique for of penalizing large errors in the model by reducing errors to zero or as minimum as possible for features. The three common regularization techniques for regression model is LASSO (Least Absolute Shrinkage and Selection Operator) regression also known as L1 norm, Ridge regression and elastic net regression. LASSO regression is used for feature selection since it penalizes the model coefficient to zero by adding absolute values of coefficient as penalty term. On the other hand the Ridge regression doesn’t reduce the error to zero as it adds the squared values of the coefficients as penalty term. The Elastic Net Regression is a combination of LASSO and Ridge Regression. The property of LASSO regression of reducing coefficient error to zero makes it useful in feature selection.
                    

from sklearn.linear_model import Lasso, LogisticRegression
from sklearn.feature_selection import SelectFromModel

model=LogisticRegression(C=1,solver='liblinear', penalty='l1')
select_model=SelectFromModel(model,threshold=1.0)
select_model.fit(X_scaled,y)

l1-regularization-select-from-model

Features with 0 coefficients

                    

np.round(select_model.estimator_.coef_,1)

l1-regularization-estimator-coef

Get features support

                    

select_model.get_support()

array([False, True, True, True])

Get remaining features

                    

X_scaled.columns[(select_model.get_support())]

Index([‘sepal_width’, ‘petal_length’, ‘petal_width’], dtype=’object’)

4. Low Variance Filter. In this technique we remove either variables that has low or constant variance. We first normalize our data and set a variance threshold. We then compute the variance between the variables and for variables with variance less than the threshold we remove one of them. The low variance features don’t add much value to model prediction hence we discard them.

  1. Low Variance filter using sklearn VarianceThreshold
                    

from sklearn.feature_selection import VarianceThreshold

threshold_value=0.5
vt=VarianceThreshold(threshold=threshold_value)   
vt_fit=vt.fit_transform(features)

Get variance for each feature

                    

print(vt.variances_) # actual variance values

[0.68112222 0.18675067 3.09242489 0.57853156]

i. Get features with high variance

                    

feature_variance=pd.DataFrame(vt.variances_,columns=['Score'],index=features.columns).sort_values(by='Score',ascending=False)
feature_variance

low-variance-filter-features-score

Visualize Feature Variance

                    

sns.barplot(y=feature_variance.index, x=feature_variance['Score'])
plt.xlabel('Feature Variance')
plt.ylabel('Features')
plt.title("Visualizing Feature Variance")
plt.show()

low-variance-filter-features-score-visualization

Select best 3 features

                    

print(feature_variance['Score'].nlargest(3))

low-variance-filter-select-3-best-features-score

ii. Get columns with low variance based on 0.05 threshold i.e value with similarity of 50%

                    

print(vt.variances_) # actual variance values
print(vt.get_support()) # True has high variance False has low variance than the threshold

low_var_=[column for column in features.columns if column not in features.columns[vt.get_support()]]
low_var_

low-variance-filter-features-score-by-50-threshold

                    

remove_low_vars_=features.drop(low_var_,axis=1)

# Drop features with variance less than 50%
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
low_vars_df_ = remove_low_vars_.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')

display_html(iris_df._repr_html_()+"     "+low_vars_df_._repr_html_(), raw=True)

low-variance-filter-features-score-threshold-summary

iii. Get Feature Variance from DataFrame

                    

variance=features.var()
variance

low-variance-filter-features-score-df-var

Low Variance Feature

                    

low_var_features = [ ]

for i in range(0,len(variance)):
    if variance[i]<=0.50: 
        low_var_features.append(features.columns[i])
        
low_var_features

[‘sepal_width’]

Remove low variance feature

                    

remove_low_var_features=features.drop(low_var_features,axis=1)
remove_low_var_features.head()

low-variance-filter-remove-low-variance-features

                    

# Drop features with variance less than 50%
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
low_var_features_df = remove_low_var_features.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')

display_html(iris_df._repr_html_()+"     "+low_var_features_df._repr_html_(), raw=True)

low-variance-filter-remove-low-variance-features-summary

5. High Correlation Filter. When two features/variables are highly correlated, we have to remove one of them since this leads to multicollinearity. Multicollinearity affects the interpretation of machine learning model and makes the model complex due to highly correlated features.

Get the correlation matrix first

                    

cor_matrix=features.corr()
cor_matrix

dimesnionality-reduction-high-variance-filter-correlation-matrix

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.heatmap(cor_matrix, annot=True, fmt=".4f", linewidths=.5,cmap='winter')
plt.show()

dimesnionality-reduction-high-variance-visualize-correlation-matrix

Remove correlated features that are duplicates

dimesnionality-reduction-high-variance-filter-unstack-correlation-matrix

cor_matrix.unstack().sort_values().drop_duplicates() # drop duplicates features

dimesnionality-reduction-high-variance-filter-unstack-correlation-matrix-remove-duplicates

Remove correlated features based on defined correlation threshold

                    

# Create an upper triangle matrix with np.triu
upper_triangle_martix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
upper_triangle_martix

dimesnionality-reduction-high-variance-filter-correlation-threshold

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['font.size'] = 20
sns.heatmap(upper_triangle_martix, annot=True, fmt=".4f", linewidths=.5,cmap='Greens')
plt.show()

dimesnionality-reduction-high-variance-filter-visulaize-upper-triangle-matrix

High correlated feature

                    

correlated_features=[column for column in upper_triangle_martix.columns if any(upper_triangle_martix[column] > 0.90)]
correlated_features

[‘petal_width’]

                    

remove_correlated_features=features.drop(correlated_features,axis=1)
# Display the two dataframes side by side
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
uncorrelated_df = remove_correlated_features.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Correlated Features')

display_html(iris_df._repr_html_()+"     "+uncorrelated_df._repr_html_(), raw=True)

dimesnionality-reduction-high-variance-filter-remove-high-correlated-features

For complete code check feature engineering notebook here.

Conclusion

Feature selection is an important step in machine learning model development. It allows us to select the most relevant features that have high predictive power. The benefits of feature selection include; improved model performance (in terms of accuracy, precision etc.), reduces the dimensionality of data which makes analysis and modelling of data easy, fewer features makes it faster for the model to train as it requires fewer compute resources. We have seen several techniques of feature selection. In the next post we will look at Feature Encoding for categorical variables. To learn about Dimensionality Reduction check our previous post here.

Feature Selection

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x