Working with data with high dimension often possess a challenge commonly referred to as the ** “curse of dimensionality”.** It’s difficult to analyse, visualize and model high-dimensional data hence the need to transform the data from high-dimension to low-dimension. The process of transforming data from high-dimensional space to low-dimensional space but preserving the most important features of the data is referred to as dimensionality reduction. There are various techniques for dimensionality reduction categorized based on linear such as Principal Component Analysis (PCA) or non-linear such as Kernel PCA or Autoencoder. In this post we will look at various dimensionality reduction techniques and how to implement them in Python. Download the data for this post

**here**.

**Importance of Dimensionality Reduction**

- Resolves multicollinearity by removing redundant features.
- Data in low-dimensional space is easy to analyse and visualize.
- It’s faster to compute low-dimensional data due to less features.

**Challenges of Dimensionality Reduction**

The main disadvantage of dimensionality reduction is the loss of some data as the data is projected from high-dimension to low-dimension space.

**Dimensionality Reduction Techniques**

**Import Required Libraries**

import pandas as pd import numpy as np from sklearn.preprocessing import StandardScaler from sklearn.decomposition import PCA from IPython.display import display_html import matplotlib.pyplot as plt import seaborn as sns # Configure visual display properties sns.set_theme(style="whitegrid") sns.set(rc={'figure.figsize':(20,10)})

**Load Data**

iris_df=pd.read_csv('iris.csv') iris_df.head()

**Standarddize data first**

PCA requires us to standardize the data first to a unit scale with mean 0 and variance 1.

features=iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] # features y=iris_df['class'] #target X=StandardScaler().fit_transform(features) # scale the data X_scaled=pd.DataFrame(X, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']) # create dataframe of scaled data scaled_df=pd.concat([X_scaled,y],axis=1) # join scaled dataframe with target variable # Display the two dataframes side by side iris_df = iris_df.head().style.set_table_attributes("style='display:inline'").set_caption('Original DataFrame') scaled_df = scaled_df.head().style.set_table_attributes("style='display:inline'").set_caption('Scaled DtaFrame') display_html(iris_df._repr_html_()+" "+scaled_df._repr_html_(), raw=True)

**Principal component analysis (PCA).**This is a commonly used linear technique for dimensionality reduction mostly used in data analysis and predictive modelling. It projects data from high-dimensional space to low-dimensions referred to as principal components. PCA attempts to look for uncorrelated factors. The first principle component describes the maximum variance in the data. The second principal component explains the variation not captured in the first component. The third principal component extracts the variation not explained in the first and second component, etc. Theprincipal component can be taken as a direction orthogonal to the first i-1 principal components that maximizes the variance of the projected data*i-th*

**Compute PCA with 2 components**

pca=PCA(n_components=2) pca_fit=pca.fit_transform(X_scaled) # fit and transform data with 2 Components pca_df=pd.DataFrame(pca_fit,columns=['PC1','PC2']) # Create PCA dataframe pca_df=pd.concat([pca_df,y],axis=1) # join pca dataframe with target variable pca_df.head()

**Visualize PCA**

plt.rcParams['axes.labelsize'] = 20 sns.set(font_scale = 2) sns.scatterplot(x="PC1", y="PC2", sizes=(1, 8), linewidth=0,data=pca_df,hue='class') plt.show()

**Explained Variance**

Explained Variance tells us the variation in each principal component and how much information we have lost when reducing the dimensionality of the data from high-space (4 features) to low-space (2 features). Variance for the first and second principal components.

(pca.explained_variance_ratio_)*100

**array([72.77045209, 23.03052327])**

Total variance after dimensionality reduction

sum(pca.explained_variance_ratio_)*100

**95.80097536148199**

**2. Independent Component Analysis (ICA).** This is another useful technique for dimensionality reduction that transforms multivariate data points into independent non-Gaussian data. Different from PCA which looks for uncorrelated components ICA looks for independent factors.

**Compute ICA with 2 Components**

from sklearn.decomposition import FastICA ica=FastICA(n_components=2,random_state=0) ica_fit=ica.fit_transform(X_scaled) ica_df=pd.DataFrame(ica_fit,columns=['IC1','IC2']) ica_df=pd.concat([ica_df,y],axis=1) ica_df.head()

**Visualize Independet Component Analysis**

plt.rcParams['axes.labelsize'] = 20 sns.set(font_scale = 2) sns.scatterplot(x="IC1", y="IC2", sizes=(1, 8), linewidth=0,data=ica_df,hue='class') plt.show()

**3. Random Forest.** Random Forest algorithm can be used for feature selections. It has a feature importance property that orders features importance based on the following techniques; 1. Built-in feature importance, 2. Permutation based importance and 3. SHAP values based importance.

**Get feature importance with Random Forest**

from sklearn.ensemble import RandomForestClassifier rf_clf=RandomForestClassifier(n_estimators=100) rf_clf.fit(X_scaled,y) # train the model feature_imp = pd.Series(rf_clf.feature_importances_,index=X_scaled.columns).sort_values(ascending=False) feature_imp

**Visualize Feature Importance**

sns.barplot(y=feature_imp.index, x=feature_imp) plt.xlabel('Feature Importance Score') plt.ylabel('Features') plt.title("Visualizing Important Features") plt.show()

**4. Factor Analysis.** This is a technique that first groups the features into different categories (commonly referred to as factors) based on their correlations. According to **Wikipedia** describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.

**Compute Factor Analysis**

from sklearn.decomposition import FactorAnalysis fa=FactorAnalysis(n_components=2,random_state=0) fa_transform=fa.fit_transform(X_scaled) fa_df=pd.DataFrame(fa_transform,columns=['FA1','FA2']) fa_df=pd.concat([fa_df,y],axis=1) fa_df.head()

**Visualize Factor Analysis output**

plt.rcParams['axes.labelsize'] = 20 sns.set(font_scale = 2) sns.scatterplot(x="FA1", y="FA2", sizes=(1, 8), linewidth=0,data=fa_df,hue='class') plt.show()

**Compute Features Covariance**

fa_cov=pd.DataFrame(fa.get_covariance(),columns=X_scaled.columns,index=X_scaled.columns) fa_cov

plt.rcParams['axes.labelsize'] = 20 sns.set(font_scale = 2) plt.rcParams['text.color'] = 'blue' plt.rcParams['font.size'] = 20 sns.heatmap(fa_cov, annot=True, fmt=".4f", linewidths=.5,cmap='summer') plt.show()

**5. Low Variance filter.** In this technique we remove either variables that has low or constant variance. We first normalize our data and set a variance threshold. We then compute the variance between the variables and for variables with variance less than the threshold we remove one of them. The low variance features don’t add much value to model prediction hence we discard them.

**Compute Variance with sklearn VarianceThreshold**

from sklearn.feature_selection import VarianceThreshold threshold_value=0.5 vt=VarianceThreshold(threshold=threshold_value) vt_fit=vt.fit_transform(features)

Get variances for each feature

print(vt.variances_) # actual variance values print(vt.get_support()) # True has high variance False has low variance than the threshold

Get columns with low variance based on 0.05 threshold i.e value with similarity of 50%

low_var_=[column for column in features.columns if column not in features.columns[vt.get_support()]] low_var_

**[‘sepal_width’] **

Remove Low Variance Features with 50% cutoff

remove_low_vars_=features.drop(low_var_,axis=1) # Drop features with variance less than 50% iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data') low_vars_df_ = remove_low_vars_.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features') display_html(iris_df._repr_html_()+" "+low_vars_df_._repr_html_(), raw=True)

**Low Variance Filter with Pandas df.var() function**

variance=features.var() variance

Low variance features with < 50% threshold

low_var_features = [ ] for i in range(0,len(variance)): if variance[i]<=0.50: low_var_features.append(features.columns[i]) low_var_features

**[‘sepal_width’]**

Drop Low variance features

remove_low_var_features=features.drop(low_var_features,axis=1) remove_low_var_features.head() # Drop features with variance less than 50% iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data') low_var_features_df = remove_low_var_features.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')display_html(iris_df._repr_html_()+” ”+low_var_features_df._repr_html_(), raw=True)

**6. High Correlation Filter.** When two features/variables are highly correlated, we have to remove one of them since this leads to multicollinearity. Multicollinearity affects the interpretation of machine learning model and makes the model complex due to highly correlated features.

Get the correlation matrix first

cor_matrix=features.corr() cor_matrix

Visualize Correlation Matrix

plt.rcParams['axes.labelsize'] = 20 sns.set(font_scale = 2) plt.rcParams['text.color'] = 'blue' plt.rcParams['font.size'] = 20 sns.heatmap(cor_matrix, annot=True, fmt=".4f", linewidths=.5,cmap='winter') plt.show()

**Remove correlated features that are duplicates**

cor_matrix.unstack().sort_values()

cor_matrix.unstack().sort_values().drop_duplicates() # drop duplicates features

**Remove correlated features based on defined correlation threshold**

# Create an upper triangle matrix with np.triu upper_triangle_martix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool)) upper_triangle_martix

Visualize Upper Triangle Matrix

plt.rcParams['axes.labelsize'] = 20 sns.set(font_scale = 2) plt.rcParams['font.size'] = 20 sns.heatmap(upper_triangle_martix, annot=True, fmt=".4f", linewidths=.5,cmap='Greens') plt.show()

Filter highly correlated features with > 90% threshold

correlated_features=[column for column in upper_triangle_martix.columns if any(upper_triangle_martix[column] > 0.90)] correlated_features

**[‘petal_width’]**

Remove highly correlated features identified above

remove_correlated_features=features.drop(correlated_features,axis=1)# Display the two dataframes side by side

iris_df = features.head().style.set_table_attributes(“style=’display:inline'”).set_caption(‘Original Data’)

uncorrelated_df = remove_correlated_features.head().style.set_table_attributes(“style=’display:inline'”).set_caption(‘After Removing Correlated Features’)display_html(iris_df._repr_html_()+” ”+uncorrelated_df._repr_html_(), raw=True)

**7. t-Distributed Stochastic Neighbor Embedding (t-SNE).** This is a nonlinear technique for dimensionality reduction. It works by computing the probability similarity of data points in high-dimensional space and low-dimensional space using distance metrics such as Euclidean. This method is ideal for visualizing high-dimensional data. However, caution should be taken for clustering and outlier detection use-cases as it does not maintain distances.

**Reduce data dimension with t-SNE from 4 to 2 components**

from sklearn.manifold import TSNE tsne=TSNE(n_components=2,init='random',random_state=0) tsne_fit=tsne.fit_transform(X_scaled) tsne_df=pd.DataFrame(tsne_fit,columns=['Component 1','Component 2']) tsne_df=pd.concat([tsne_df,y],axis=1) tsne_df.head()

Visualize t-SNE

plt.rcParams['axes.labelsize'] = 20 sns.set(font_scale = 2) sns.scatterplot(x="Component 1", y="Component 2", sizes=(1, 8), linewidth=0,data=tsne_df,hue='class') plt.show()

**8. Uniform Manifold Approximation and Project (UMAP**). This is a nonlinear dimension reduction technique similar to t-SNE but with improved runtime and able to preserve much of local and global structure of the data.

**9. Backward Feature Elimination.** This technique involves training a model with all features and evaluating its performance. We then remove one feature at a time and we re-train the model and evaluate the performance again. We perform this process n times the combination of the features until we get the features that gives us maximum performance. This approach is usually used in regression models.

**10. Forward Feature Elimination.** Here we start with training the model with a single feature. We evaluate the performance and add another feature until n-times while in each step we measure the performance. We then select features that give us optimal performance.

**11. Autoencoders.** This is a shallow neural network that encodes and decodes the data. It’s a nonlinear technique for dimensionality reduction.

For complete code check the feature-engineering notebook **here**.

**Conclusion**

Data with many features poses a challenge when analysing, visualizing and modelling it. The solution is to reduce the number of features and only remain with few features that carry maximum amount of information. In this post we have looked at what dimensionality reduction is, its benefits and one major disadvantage. We have also looked at the commonly used techniques for reducing the dimensionality of data and there implementation in Python. There are other techniques that we haven’t exhausted but we have only covered the mostly used. In the next post we will look at **Feature Selection** which is important feature engineering process. To learn about feature engineering and various methods used check our previous post **here**.