Working with data with high dimension often possess a challenge commonly referred to as the “curse of dimensionality”. It’s difficult to analyse, visualize and model high-dimensional data hence the need to transform the data from high-dimension to low-dimension. The process of transforming data from high-dimensional space to low-dimensional space but preserving the most important features of the data is referred to as dimensionality reduction. There are various techniques for dimensionality reduction categorized based on linear such as Principal Component Analysis (PCA) or non-linear such as Kernel PCA or Autoencoder. In this post we will look at various dimensionality reduction techniques and how to implement them in Python. Download the data for this post here.

dimensionality-reduction-image

Importance of Dimensionality Reduction

  1. Resolves multicollinearity by removing redundant features.
  2. Data in low-dimensional space is easy to analyse and visualize.
  3. It’s faster to compute low-dimensional data due to less features.

Challenges of Dimensionality Reduction

The main disadvantage of dimensionality reduction is the loss of some data as the data is projected from high-dimension to low-dimension space.

Dimensionality Reduction Techniques

Import Required Libraries

                    

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from IPython.display import display_html 
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visual display properties
sns.set_theme(style="whitegrid")
sns.set(rc={'figure.figsize':(20,10)})

Load Data

                    

iris_df=pd.read_csv('iris.csv')
iris_df.head()

seaborn-scatter-plot-load-iris-data

Standarddize data first
PCA requires us to standardize the data first to a unit scale with mean 0 and variance 1.

                    

features=iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] # features
y=iris_df['class'] #target

X=StandardScaler().fit_transform(features) # scale the data
X_scaled=pd.DataFrame(X, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']) # create dataframe of scaled data
scaled_df=pd.concat([X_scaled,y],axis=1) # join scaled dataframe with target variable

# Display the two dataframes side by side
iris_df = iris_df.head().style.set_table_attributes("style='display:inline'").set_caption('Original DataFrame')
scaled_df = scaled_df.head().style.set_table_attributes("style='display:inline'").set_caption('Scaled DtaFrame')

display_html(iris_df._repr_html_()+"     "+scaled_df._repr_html_(), raw=True)

dimesnionality-reduction-standardize-data

  1. Principal component analysis (PCA). This is a commonly used linear technique for dimensionality reduction mostly used in data analysis and predictive modelling. It projects data from high-dimensional space to low-dimensions referred to as principal components. PCA attempts to look for uncorrelated factors. The first principle component describes the maximum variance in the data. The second principal component explains the variation not captured in the first component. The third principal component extracts the variation not explained in the first and second component, etc. The i-th principal component can be taken as a direction orthogonal to the first i-1 principal components that maximizes the variance of the projected data

Compute PCA with 2 components

                    

pca=PCA(n_components=2)
pca_fit=pca.fit_transform(X_scaled) # fit and transform data with 2 Components
pca_df=pd.DataFrame(pca_fit,columns=['PC1','PC2']) # Create PCA dataframe
pca_df=pd.concat([pca_df,y],axis=1) # join pca dataframe with target variable
pca_df.head()

dimesnionality-reduction-pca

Visualize PCA

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="PC1", y="PC2", sizes=(1, 8), linewidth=0,data=pca_df,hue='class')
plt.show()

dimesnionality-reduction-visualize-pca

Explained Variance
Explained Variance tells us the variation in each principal component and how much information we have lost when reducing the dimensionality of the data from high-space (4 features) to low-space (2 features). Variance for the first and second principal components.

                    

(pca.explained_variance_ratio_)*100

array([72.77045209, 23.03052327])

Total variance after dimensionality reduction

                    

sum(pca.explained_variance_ratio_)*100

95.80097536148199

2. Independent Component Analysis (ICA). This is another useful technique for dimensionality reduction that transforms multivariate data points into independent non-Gaussian data. Different from PCA which looks for uncorrelated components ICA looks for independent factors.

Compute ICA with 2 Components

                    

from sklearn.decomposition import FastICA

ica=FastICA(n_components=2,random_state=0)
ica_fit=ica.fit_transform(X_scaled)
ica_df=pd.DataFrame(ica_fit,columns=['IC1','IC2'])
ica_df=pd.concat([ica_df,y],axis=1)
ica_df.head()

dimesnionality-reduction-ica

Visualize Independet Component Analysis

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="IC1", y="IC2", sizes=(1, 8), linewidth=0,data=ica_df,hue='class')
plt.show()

dimesnionality-reduction-visualize-ica

3. Random Forest. Random Forest algorithm can be used for feature selections. It has a feature importance property that orders features importance based on the following techniques; 1. Built-in feature importance, 2. Permutation based importance and 3. SHAP values based importance.

Get feature importance with Random Forest

                    

from sklearn.ensemble import RandomForestClassifier

rf_clf=RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_scaled,y) # train the model

feature_imp = pd.Series(rf_clf.feature_importances_,index=X_scaled.columns).sort_values(ascending=False)
feature_imp

dimesnionality-reduction-random-forest-feature-importance

Visualize Feature Importance

                    

sns.barplot(y=feature_imp.index, x=feature_imp)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()

dimesnionality-reduction-visualize-random-forest-feature-importance

4. Factor Analysis. This is a technique that first groups the features into different categories (commonly referred to as factors) based on their correlations. According to Wikipedia describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.

Compute Factor Analysis

                    

from sklearn.decomposition import FactorAnalysis

fa=FactorAnalysis(n_components=2,random_state=0)
fa_transform=fa.fit_transform(X_scaled)
fa_df=pd.DataFrame(fa_transform,columns=['FA1','FA2'])
fa_df=pd.concat([fa_df,y],axis=1)
fa_df.head()

dimesnionality-reduction-factor-analysis

Visualize Factor Analysis output

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="FA1", y="FA2", sizes=(1, 8), linewidth=0,data=fa_df,hue='class')
plt.show()

dimesnionality-reduction-visualize-factor-analysis

Compute Features Covariance

                    

fa_cov=pd.DataFrame(fa.get_covariance(),columns=X_scaled.columns,index=X_scaled.columns)
fa_cov

dimesnionality-reduction-factor-analysis-get-variance

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.heatmap(fa_cov, annot=True, fmt=".4f", linewidths=.5,cmap='summer')
plt.show()

dimesnionality-reduction-factor-analysis-visualize-covariance

5. Low Variance filter. In this technique we remove either variables that has low or constant variance. We first normalize our data and set a variance threshold. We then compute the variance between the variables and for variables with variance less than the threshold we remove one of them. The low variance features don’t add much value to model prediction hence we discard them.

Compute Variance with sklearn VarianceThreshold

                    

from sklearn.feature_selection import VarianceThreshold

threshold_value=0.5
vt=VarianceThreshold(threshold=threshold_value)   
vt_fit=vt.fit_transform(features)

Get variances for each feature

                    

print(vt.variances_) # actual variance values
print(vt.get_support()) # True has high variance False has low variance than the threshold

dimesnionality-reduction-low-variance-filter-feature-variances

Get columns with low variance based on 0.05 threshold i.e value with similarity of 50%

                    

low_var_=[column for column in features.columns if column not in features.columns[vt.get_support()]]
low_var_

[‘sepal_width’]

Remove Low Variance Features with 50% cutoff

                    
remove_low_vars_=features.drop(low_var_,axis=1)

# Drop features with variance less than 50%
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
low_vars_df_ = remove_low_vars_.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')

display_html(iris_df._repr_html_()+"     "+low_vars_df_._repr_html_(), raw=True)


dimesnionality-reduction-low-variance-filter-remove-low-variance-features
Low Variance Filter with Pandas df.var() function
                    

variance=features.var()
variance

dimesnionality-reduction-low-variance-filter-with-df-var-function

Low variance features with < 50% threshold

                    

low_var_features = [ ]

for i in range(0,len(variance)):
    if variance[i]<=0.50: 
        low_var_features.append(features.columns[i])
        
low_var_features

[‘sepal_width’]

Drop Low variance features

                    

remove_low_var_features=features.drop(low_var_features,axis=1)
remove_low_var_features.head()

# Drop features with variance less than 50%
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
low_var_features_df = remove_low_var_features.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')

display_html(iris_df._repr_html_()+”&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;”+low_var_features_df._repr_html_(), raw=True)

dimesnionality-reduction-low-variance-filter-remove-low-variance-features

6. High Correlation Filter. When two features/variables are highly correlated, we have to remove one of them since this leads to multicollinearity. Multicollinearity affects the interpretation of machine learning model and makes the model complex due to highly correlated features.

Get the correlation matrix first

                    

cor_matrix=features.corr()
cor_matrix

dimesnionality-reduction-high-variance-filter-correlation-matrix

Visualize Correlation Matrix

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.heatmap(cor_matrix, annot=True, fmt=".4f", linewidths=.5,cmap='winter')
plt.show()

dimesnionality-reduction-high-variance-visualize-correlation-matrix

Remove correlated features that are duplicates

                    

cor_matrix.unstack().sort_values()

dimesnionality-reduction-high-variance-filter-unstack-correlation-matrix

                    

cor_matrix.unstack().sort_values().drop_duplicates() # drop duplicates features

dimesnionality-reduction-high-variance-filter-unstack-correlation-matrix-remove-duplicates

Remove correlated features based on defined correlation threshold

                    

# Create an upper triangle matrix with np.triu
upper_triangle_martix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
upper_triangle_martix

dimesnionality-reduction-high-variance-filter-correlation-threshold

Visualize Upper Triangle Matrix

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['font.size'] = 20
sns.heatmap(upper_triangle_martix, annot=True, fmt=".4f", linewidths=.5,cmap='Greens')
plt.show()

dimesnionality-reduction-high-variance-filter-visulaize-upper-triangle-matrix

Filter highly correlated features with > 90% threshold

                    

correlated_features=[column for column in upper_triangle_martix.columns if any(upper_triangle_martix[column] > 0.90)]
correlated_features

[‘petal_width’]
Remove highly correlated features identified above

                    

remove_correlated_features=features.drop(correlated_features,axis=1)

# Display the two dataframes side by side
iris_df = features.head().style.set_table_attributes(“style=’display:inline'”).set_caption(‘Original Data’)
uncorrelated_df = remove_correlated_features.head().style.set_table_attributes(“style=’display:inline'”).set_caption(‘After Removing Correlated Features’)

display_html(iris_df._repr_html_()+”&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;”+uncorrelated_df._repr_html_(), raw=True)

dimesnionality-reduction-high-variance-filter-remove-high-correlated-features

7. t-Distributed Stochastic Neighbor Embedding (t-SNE). This is a nonlinear technique for dimensionality reduction. It works by computing the probability similarity of data points in high-dimensional space and low-dimensional space using distance metrics such as Euclidean. This method is ideal for visualizing high-dimensional data. However, caution should be taken for clustering and outlier detection use-cases as it does not maintain distances.

Reduce data dimension with t-SNE from 4 to 2 components

                    

from sklearn.manifold import TSNE

tsne=TSNE(n_components=2,init='random',random_state=0)
tsne_fit=tsne.fit_transform(X_scaled)
tsne_df=pd.DataFrame(tsne_fit,columns=['Component 1','Component 2'])
tsne_df=pd.concat([tsne_df,y],axis=1)
tsne_df.head()

dimesnionality-reduction-t-SNE-data

Visualize t-SNE

                    

plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="Component 1", y="Component 2", sizes=(1, 8), linewidth=0,data=tsne_df,hue='class')
plt.show()

dimesnionality-reduction-t-SNE-visualize-t-SNE

8. Uniform Manifold Approximation and Project (UMAP). This is a nonlinear dimension reduction technique similar to t-SNE but with improved runtime and able to preserve much of local and global structure of the data.

9. Backward Feature Elimination. This technique involves training a model with all features and evaluating its performance. We then remove one feature at a time and we re-train the model and evaluate the performance again. We perform this process n times the combination of the features until we get the features that gives us maximum performance. This approach is usually used in regression models.

10. Forward Feature Elimination. Here we start with training the model with a single feature. We evaluate the performance and add another feature until n-times while in each step we measure the performance. We then select features that give us optimal performance.

11. Autoencoders. This is a shallow neural network that encodes and decodes the data. It’s a nonlinear technique for dimensionality reduction.

For complete code check the feature-engineering notebook here.

Conclusion

Data with many features poses a challenge when analysing, visualizing and modelling it. The solution is to reduce the number of features and only remain with few features that carry maximum amount of information. In this post we have looked at what dimensionality reduction is, its benefits and one major disadvantage. We have also looked at the commonly used techniques for reducing the dimensionality of data and there implementation in Python. There are other techniques that we haven’t exhausted but we have only covered the mostly used. In the next post we will look at Feature Selection which is important feature engineering process. To learn about feature engineering and various methods used check our previous post here.

Dimensionality Reduction

Post navigation


0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x