Working with data with high dimension often possess a challenge commonly referred to as the “curse of dimensionality”. It’s difficult to analyse, visualize and model high-dimensional data hence the need to transform the data from high-dimension to low-dimension. The process of transforming data from high-dimensional space to low-dimensional space but preserving the most important features of the data is referred to as dimensionality reduction. There are various techniques for dimensionality reduction categorized based on linear such as Principal Component Analysis (PCA) or non-linear such as Kernel PCA or Autoencoder. In this post we will look at various dimensionality reduction techniques and how to implement them in Python. Download the data for this post here.

## Importance of Dimensionality Reduction

1. Resolves multicollinearity by removing redundant features.
2. Data in low-dimensional space is easy to analyse and visualize.
3. It’s faster to compute low-dimensional data due to less features.

## Challenges of Dimensionality Reduction

The main disadvantage of dimensionality reduction is the loss of some data as the data is projected from high-dimension to low-dimension space.

## Dimensionality Reduction Techniques

Import Required Libraries

```                    ```
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from IPython.display import display_html
import matplotlib.pyplot as plt
import seaborn as sns

# Configure visual display properties
sns.set_theme(style="whitegrid")
sns.set(rc={'figure.figsize':(20,10)})
```
```

```                    ```
```
```

Standarddize data first
PCA requires us to standardize the data first to a unit scale with mean 0 and variance 1.

```                    ```
features=iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']] # features
y=iris_df['class'] #target

X=StandardScaler().fit_transform(features) # scale the data
X_scaled=pd.DataFrame(X, columns=['sepal_length', 'sepal_width', 'petal_length', 'petal_width']) # create dataframe of scaled data
scaled_df=pd.concat([X_scaled,y],axis=1) # join scaled dataframe with target variable

# Display the two dataframes side by side

display_html(iris_df._repr_html_()+"     "+scaled_df._repr_html_(), raw=True)
```
```

1. Principal component analysis (PCA). This is a commonly used linear technique for dimensionality reduction mostly used in data analysis and predictive modelling. It projects data from high-dimensional space to low-dimensions referred to as principal components. PCA attempts to look for uncorrelated factors. The first principle component describes the maximum variance in the data. The second principal component explains the variation not captured in the first component. The third principal component extracts the variation not explained in the first and second component, etc. The i-th principal component can be taken as a direction orthogonal to the first i-1 principal components that maximizes the variance of the projected data

Compute PCA with 2 components

```                    ```
pca=PCA(n_components=2)
pca_fit=pca.fit_transform(X_scaled) # fit and transform data with 2 Components
pca_df=pd.DataFrame(pca_fit,columns=['PC1','PC2']) # Create PCA dataframe
pca_df=pd.concat([pca_df,y],axis=1) # join pca dataframe with target variable
```
```

Visualize PCA

```                    ```
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="PC1", y="PC2", sizes=(1, 8), linewidth=0,data=pca_df,hue='class')
plt.show()
```
```

Explained Variance
Explained Variance tells us the variation in each principal component and how much information we have lost when reducing the dimensionality of the data from high-space (4 features) to low-space (2 features). Variance for the first and second principal components.

```                    ```
(pca.explained_variance_ratio_)*100
```
```

array([72.77045209, 23.03052327])

Total variance after dimensionality reduction

```                    ```
sum(pca.explained_variance_ratio_)*100
```
```

95.80097536148199

2. Independent Component Analysis (ICA). This is another useful technique for dimensionality reduction that transforms multivariate data points into independent non-Gaussian data. Different from PCA which looks for uncorrelated components ICA looks for independent factors.

Compute ICA with 2 Components

```                    ```
from sklearn.decomposition import FastICA

ica=FastICA(n_components=2,random_state=0)
ica_fit=ica.fit_transform(X_scaled)
ica_df=pd.DataFrame(ica_fit,columns=['IC1','IC2'])
ica_df=pd.concat([ica_df,y],axis=1)
```
```

Visualize Independet Component Analysis

```                    ```
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="IC1", y="IC2", sizes=(1, 8), linewidth=0,data=ica_df,hue='class')
plt.show()
```
```

3. Random Forest. Random Forest algorithm can be used for feature selections. It has a feature importance property that orders features importance based on the following techniques; 1. Built-in feature importance, 2. Permutation based importance and 3. SHAP values based importance.

Get feature importance with Random Forest

```                    ```
from sklearn.ensemble import RandomForestClassifier

rf_clf=RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_scaled,y) # train the model

feature_imp = pd.Series(rf_clf.feature_importances_,index=X_scaled.columns).sort_values(ascending=False)
feature_imp
```
```

Visualize Feature Importance

```                    ```
sns.barplot(y=feature_imp.index, x=feature_imp)
plt.xlabel('Feature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features")
plt.show()
```
```

4. Factor Analysis. This is a technique that first groups the features into different categories (commonly referred to as factors) based on their correlations. According to Wikipedia describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors.

Compute Factor Analysis

```                    ```
from sklearn.decomposition import FactorAnalysis

fa=FactorAnalysis(n_components=2,random_state=0)
fa_transform=fa.fit_transform(X_scaled)
fa_df=pd.DataFrame(fa_transform,columns=['FA1','FA2'])
fa_df=pd.concat([fa_df,y],axis=1)
```
```

Visualize Factor Analysis output

```                    ```
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="FA1", y="FA2", sizes=(1, 8), linewidth=0,data=fa_df,hue='class')
plt.show()
```
```

Compute Features Covariance

```                    ```
fa_cov=pd.DataFrame(fa.get_covariance(),columns=X_scaled.columns,index=X_scaled.columns)
fa_cov
```
```

```                    ```
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.heatmap(fa_cov, annot=True, fmt=".4f", linewidths=.5,cmap='summer')
plt.show()
```
```

5. Low Variance filter. In this technique we remove either variables that has low or constant variance. We first normalize our data and set a variance threshold. We then compute the variance between the variables and for variables with variance less than the threshold we remove one of them. The low variance features don’t add much value to model prediction hence we discard them.

Compute Variance with sklearn VarianceThreshold

```                    ```
from sklearn.feature_selection import VarianceThreshold

threshold_value=0.5
vt=VarianceThreshold(threshold=threshold_value)
vt_fit=vt.fit_transform(features)
```
```

Get variances for each feature

```                    ```
print(vt.variances_) # actual variance values
print(vt.get_support()) # True has high variance False has low variance than the threshold
```
```

Get columns with low variance based on 0.05 threshold i.e value with similarity of 50%

```                    ```
low_var_=[column for column in features.columns if column not in features.columns[vt.get_support()]]
low_var_
```
```

[‘sepal_width’]

Remove Low Variance Features with 50% cutoff

```                    ```
remove_low_vars_=features.drop(low_var_,axis=1)

# Drop features with variance less than 50%
low_vars_df_ = remove_low_vars_.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')

display_html(iris_df._repr_html_()+"     "+low_vars_df_._repr_html_(), raw=True)
```
```

Low Variance Filter with Pandas df.var() function
```                    ```
variance=features.var()
variance
```
```

Low variance features with < 50% threshold

```                    ```
low_var_features = [ ]

for i in range(0,len(variance)):
if variance[i]<=0.50:
low_var_features.append(features.columns[i])

low_var_features
```
```

[‘sepal_width’]

Drop Low variance features

```                    ```
remove_low_var_features=features.drop(low_var_features,axis=1)

# Drop features with variance less than 50%
low_var_features_df = remove_low_var_features.head().style.set_table_attributes("style='display:inline'").set_caption('After Removing Low Variance Features')
display_html(iris_df._repr_html_()+”&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;”+low_var_features_df._repr_html_(), raw=True)
```
```

6. High Correlation Filter. When two features/variables are highly correlated, we have to remove one of them since this leads to multicollinearity. Multicollinearity affects the interpretation of machine learning model and makes the model complex due to highly correlated features.

Get the correlation matrix first

```                    ```
cor_matrix=features.corr()
cor_matrix
```
```

Visualize Correlation Matrix

```                    ```
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['text.color'] = 'blue'
plt.rcParams['font.size'] = 20
sns.heatmap(cor_matrix, annot=True, fmt=".4f", linewidths=.5,cmap='winter')
plt.show()
```
```

Remove correlated features that are duplicates

```                    ```
cor_matrix.unstack().sort_values()
```
```

```                    ```
cor_matrix.unstack().sort_values().drop_duplicates() # drop duplicates features
```
```

Remove correlated features based on defined correlation threshold

```                    ```
# Create an upper triangle matrix with np.triu
upper_triangle_martix = cor_matrix.where(np.triu(np.ones(cor_matrix.shape),k=1).astype(np.bool))
upper_triangle_martix
```
```

Visualize Upper Triangle Matrix

```                    ```
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
plt.rcParams['font.size'] = 20
sns.heatmap(upper_triangle_martix, annot=True, fmt=".4f", linewidths=.5,cmap='Greens')
plt.show()
```
```

Filter highly correlated features with > 90% threshold

```                    ```
correlated_features=[column for column in upper_triangle_martix.columns if any(upper_triangle_martix[column] > 0.90)]
correlated_features
```
```

[‘petal_width’]
Remove highly correlated features identified above

```                    ```
remove_correlated_features=features.drop(correlated_features,axis=1)

# Display the two dataframes side by side
uncorrelated_df = remove_correlated_features.head().style.set_table_attributes(“style=’display:inline'”).set_caption(‘After Removing Correlated Features’)
display_html(iris_df._repr_html_()+”&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;”+uncorrelated_df._repr_html_(), raw=True)
```
```

7. t-Distributed Stochastic Neighbor Embedding (t-SNE). This is a nonlinear technique for dimensionality reduction. It works by computing the probability similarity of data points in high-dimensional space and low-dimensional space using distance metrics such as Euclidean. This method is ideal for visualizing high-dimensional data. However, caution should be taken for clustering and outlier detection use-cases as it does not maintain distances.

Reduce data dimension with t-SNE from 4 to 2 components

```                    ```
from sklearn.manifold import TSNE

tsne=TSNE(n_components=2,init='random',random_state=0)
tsne_fit=tsne.fit_transform(X_scaled)
tsne_df=pd.DataFrame(tsne_fit,columns=['Component 1','Component 2'])
tsne_df=pd.concat([tsne_df,y],axis=1)
```
```

Visualize t-SNE

```                    ```
plt.rcParams['axes.labelsize'] = 20
sns.set(font_scale = 2)
sns.scatterplot(x="Component 1", y="Component 2", sizes=(1, 8), linewidth=0,data=tsne_df,hue='class')
plt.show()
```
```

8. Uniform Manifold Approximation and Project (UMAP). This is a nonlinear dimension reduction technique similar to t-SNE but with improved runtime and able to preserve much of local and global structure of the data.

9. Backward Feature Elimination. This technique involves training a model with all features and evaluating its performance. We then remove one feature at a time and we re-train the model and evaluate the performance again. We perform this process n times the combination of the features until we get the features that gives us maximum performance. This approach is usually used in regression models.

10. Forward Feature Elimination. Here we start with training the model with a single feature. We evaluate the performance and add another feature until n-times while in each step we measure the performance. We then select features that give us optimal performance.

11. Autoencoders. This is a shallow neural network that encodes and decodes the data. It’s a nonlinear technique for dimensionality reduction.

For complete code check the feature-engineering notebook here.

## Conclusion

Data with many features poses a challenge when analysing, visualizing and modelling it. The solution is to reduce the number of features and only remain with few features that carry maximum amount of information. In this post we have looked at what dimensionality reduction is, its benefits and one major disadvantage. We have also looked at the commonly used techniques for reducing the dimensionality of data and there implementation in Python. There are other techniques that we haven’t exhausted but we have only covered the mostly used. In the next post we will look at Feature Selection which is important feature engineering process. To learn about feature engineering and various methods used check our previous post here.

Dimensionality Reduction