Working with data with high dimension often possess a challenge commonly referred to as the “curse of dimensionality”. It’s difficult to analyse, visualize and model high-dimensional data hence the need to transform the data from high-dimension to low-dimension. The process of transforming data from high-dimensional space to low-dimensional space but preserving the most important features of the data is referred to as dimensionality reduction. There are various techniques for dimensionality reduction categorized based on linear such as Principal Component Analysis (PCA) or non-linear such as Kernel PCA or Autoencoder. In this post we will look at various dimensionality reduction techniques and how to implement them in Python. Download the data for this post here.
Importance of Dimensionality Reduction
- Resolves multicollinearity by removing redundant features.
- Data in low-dimensional space is easy to analyse and visualize.
- It’s faster to compute low-dimensional data due to less features.
Challenges of Dimensionality Reduction
The main disadvantage of dimensionality reduction is the loss of some data as the data is projected from high-dimension to low-dimension space.
Dimensionality Reduction Techniques
Import Required Libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from IPython.display import display_html
import matplotlib.pyplot as plt
import seaborn as sns
# Configure visual display properties
sns.set_theme(style="whitegrid")
sns.set(rc={'figure.figsize':(20,10)})
Load Data
iris_df=pd.read_csv('iris.csv')
iris_df.head()
Standarddize data first
PCA requires us to standardize the data first to a unit scale with mean 0 and variance 1.
features=iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] # features
y=iris_df['class'] #target
X=StandardScaler().fit_transform(features) # scale the data
X_scaled=pd.DataFrame(X, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']) # create dataframe of scaled data
scaled_df=pd.concat([X_scaled,y],axis=1) # join scaled dataframe with target variable
# Display the two dataframes side by side
iris_df = iris_df.head().style.set_table_attributes("style='display:inline'").set_caption('Original DataFrame')
scaled_df = scaled_df.head().style.set_table_attributes("style='display:inline'").set_caption('Scaled DtaFrame')
display_html(iris_df._repr_html_()+" "+scaled_df._repr_html_(), raw=True)
- Principal component analysis (PCA). This is a commonly used linear technique for dimensionality reduction mostly used in data analysis and predictive modelling. It projects data from high-dimensional space to low-dimensions referred to as principal components. PCA attempts to look for uncorrelated factors. The first principle component describes the maximum variance in the data. The second principal component explains the variation not captured in the first component. The third principal component extracts the variation not explained in the first and second component, etc. The i-th principal component can be taken as a direction orthogonal to the first i-1 principal components that maximizes the variance of the projected data
Compute PCA with 2 components
pca=PCA(n_components=2)
pca_fit=pca.fit_transform(X_scaled) # fit and transform data with 2 Components
pca_df=pd.DataFrame(pca_fit,columns=['PC1','PC2']) # Create PCA dataframe
pca_df=pd.concat([pca_df,y],axis=1) # join pca dataframe with target variable
pca_df.head()
Visualize PCA
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="PC1", y="PC2", sizes=(1, 8), linewidth=0,data=pca_df,hue='class')
plt.show()
Explained Variance
Explained Variance tells us the variation in each principal component and how much information we have lost when reducing the dimensionality of the data from high-space (4 features) to low-space (2 features). Variance for the first and second principal components.
(pca.explained_variance_ratio_)*100
array([72.77045209, 23.03052327])
Total variance after dimensionality reduction
sum(pca.explained_variance_ratio_)*100
95.80097536148199
2. Independent Component Analysis (ICA). This is another useful technique for dimensionality reduction that transforms multivariate data points into independent non-Gaussian data. Different from PCA which looks for uncorrelated components ICA looks for independent factors.
Compute ICA with 2 Components
from sklearn.decomposition import FastICA
ica=FastICA(n_components=2,random_state=0)
ica_fit=ica.fit_transform(X_scaled)
ica_df=pd.DataFrame(ica_fit,columns=['IC1','IC2'])
ica_df=pd.concat([ica_df,y],axis=1)
ica_df.head()
Visualize Independet Component Analysis
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="IC1", y="IC2", sizes=(1, 8), linewidth=0,data=ica_df,hue='class')
plt.show()
3. Random Forest. Random Forest algorithm can be used for feature selections. It has a feature importance property that orders features importance based on the following techniques; 1. Built-in feature importance, 2. Permutation based importance and 3. SHAP values based importance.
Get feature importance with Random Forest
from sklearn.ensemble import RandomForestClassifier
rf_clf=RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_scaled,y) # train the model
feature_imp = pd.Series(rf_clf.feature_importances_,index=X_scaled.columns).sort_values(ascending=False)
feature_imp
Visualize Feature Importance
sns.barplot(y=feature_imp.index, x=feature_imp)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()
4. Factor Analysis. This is a technique that first groups the features into different categories (commonly referred to as factors) based on their correlations. According to Wikipedia describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.
Compute Factor Analysis
from sklearn.decomposition import FactorAnalysis
fa=FactorAnalysis(n_components=2,random_state=0)
fa_transform=fa.fit_transform(X_scaled)
fa_df=pd.DataFrame(fa_transform,columns=['FA1','FA2'])
fa_df=pd.concat([fa_df,y],axis=1)
fa_df.head()
Visualize Factor Analysis output
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="FA1", y="FA2", sizes=(1, 8), linewidth=0,data=fa_df,hue='class')
plt.show()
Compute Features Covariance
fa_cov=pd.DataFrame(fa.get_covariance(),columns=X_scaled.columns,index=X_scaled.columns)
fa_cov
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.heatmap(fa_cov, annot=True, fmt=".4f", linewidths=.5,cmap='summer')
plt.show()
5. Low Variance filter. In this technique we remove either variables that has low or constant variance. We first normalize our data and set a variance threshold. We then compute the variance between the variables and for variables with variance less than the threshold we remove one of them. The low variance features don’t add much value to model prediction hence we discard them.
Compute Variance with sklearn VarianceThreshold
from sklearn.feature_selection import VarianceThreshold
threshold_value=0.5
vt=VarianceThreshold(threshold=threshold_value)
vt_fit=vt.fit_transform(features)
Get variances for each feature
print(vt.variances_) # actual variance values
print(vt.get_support()) # True has high variance False has low variance than the threshold
Get columns with low variance based on 0.05 threshold i.e value with similarity of 50%
low_var_=[column for column in features.columns if column not in features.columns[vt.get_support()]]
low_var_
[‘sepal_width’]
Remove Low Variance Features with 50% cutoff
remove_low_vars_=features.drop(low_var_,axis=1)
# Drop features with variance less than 50%
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
low_vars_df_ = remove_low_vars_.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')
display_html(iris_df._repr_html_()+" "+low_vars_df_._repr_html_(), raw=True)

Low Variance Filter with Pandas df.var() function
variance=features.var()
variance
Low variance features with < 50% threshold
low_var_features = [ ]
for i in range(0,len(variance)):
if variance[i]<=0.50:
low_var_features.append(features.columns[i])
low_var_features
[‘sepal_width’]
Drop Low variance features
remove_low_var_features=features.drop(low_var_features,axis=1)
remove_low_var_features.head()
# Drop features with variance less than 50%
iris_df = features.head().style.set_table_attributes("style='display:inline'").set_caption('Original Data')
low_var_features_df = remove_low_var_features.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')
display_html(iris_df._repr_html_()+” ”+low_var_features_df._repr_html_(), raw=True)
6. High Correlation Filter. When two features/variables are highly correlated, we have to remove one of them since this leads to multicollinearity. Multicollinearity affects the interpretation of machine learning model and makes the model complex due to highly correlated features.
Get the correlation matrix first
cor_matrix=features.corr()
cor_matrix
Visualize Correlation Matrix
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.heatmap(cor_matrix, annot=True, fmt=".4f", linewidths=.5,cmap='winter')
plt.show()
Remove correlated features that are duplicates
cor_matrix.unstack().sort_values()
cor_matrix.unstack().sort_values().drop_duplicates() # drop duplicates features
Remove correlated features based on defined correlation threshold
# Create an upper triangle matrix with np.triu
upper_triangle_martix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
upper_triangle_martix
Visualize Upper Triangle Matrix
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['font.size'] = 20
sns.heatmap(upper_triangle_martix, annot=True, fmt=".4f", linewidths=.5,cmap='Greens')
plt.show()
Filter highly correlated features with > 90% threshold
correlated_features=[column for column in upper_triangle_martix.columns if any(upper_triangle_martix[column] > 0.90)]
correlated_features
[‘petal_width’]
Remove highly correlated features identified above
remove_correlated_features=features.drop(correlated_features,axis=1)
# Display the two dataframes side by side
iris_df = features.head().style.set_table_attributes(“style=’display:inline'”).set_caption(‘Original Data’)
uncorrelated_df = remove_correlated_features.head().style.set_table_attributes(“style=’display:inline'”).set_caption(‘After Removing Correlated Features’)
display_html(iris_df._repr_html_()+” ”+uncorrelated_df._repr_html_(), raw=True)
7. t-Distributed Stochastic Neighbor Embedding (t-SNE). This is a nonlinear technique for dimensionality reduction. It works by computing the probability similarity of data points in high-dimensional space and low-dimensional space using distance metrics such as Euclidean. This method is ideal for visualizing high-dimensional data. However, caution should be taken for clustering and outlier detection use-cases as it does not maintain distances.
Reduce data dimension with t-SNE from 4 to 2 components
from sklearn.manifold import TSNE
tsne=TSNE(n_components=2,init='random',random_state=0)
tsne_fit=tsne.fit_transform(X_scaled)
tsne_df=pd.DataFrame(tsne_fit,columns=['Component 1','Component 2'])
tsne_df=pd.concat([tsne_df,y],axis=1)
tsne_df.head()
Visualize t-SNE
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="Component 1", y="Component 2", sizes=(1, 8), linewidth=0,data=tsne_df,hue='class')
plt.show()
8. Uniform Manifold Approximation and Project (UMAP). This is a nonlinear dimension reduction technique similar to t-SNE but with improved runtime and able to preserve much of local and global structure of the data.
9. Backward Feature Elimination. This technique involves training a model with all features and evaluating its performance. We then remove one feature at a time and we re-train the model and evaluate the performance again. We perform this process n times the combination of the features until we get the features that gives us maximum performance. This approach is usually used in regression models.
10. Forward Feature Elimination. Here we start with training the model with a single feature. We evaluate the performance and add another feature until n-times while in each step we measure the performance. We then select features that give us optimal performance.
11. Autoencoders. This is a shallow neural network that encodes and decodes the data. It’s a nonlinear technique for dimensionality reduction.
For complete code check the feature-engineering notebook here.
Conclusion
Data with many features poses a challenge when analysing, visualizing and modelling it. The solution is to reduce the number of features and only remain with few features that carry maximum amount of information. In this post we have looked at what dimensionality reduction is, its benefits and one major disadvantage. We have also looked at the commonly used techniques for reducing the dimensionality of data and there implementation in Python. There are other techniques that we haven’t exhausted but we have only covered the mostly used. In the next post we will look at Feature Selection which is important feature engineering process. To learn about feature engineering and various methods used check our previous post here.