Data reduction is the process of reducing the amount of data. Data reduction can be in terms of the volume or number of features of the data. Reduction in volume can results in storage efficiency. In machine learning modelling we have to pre-process data with large numbers of features to reduce the dimensionality of the dataset to avoid the curse of dimensionality. In this post we will look at various types of data reduction.
Types of Data Reduction
- Dimensionality Reduction. Data with large number of features (such as signal processing, bio and neuroinformatics) results in high dimension data. Data at high dimensional space is complex to analyse and slow in modelling. We have to reduce the dimension of data from high-to-low to avoid the curse of dimensionality which is as a results of sparse data.
Dimensionality Reduction Techniques
- Principal Component Analysis (PCA). Principal component analysis is a linear technique for dimensionality reduction. It maps data at high-dimensional space to low by maximizing the variance of data in the low-dimensional space.
- Non-Negative Matrix Factorization (NMF). NMF is a technique of decomposing a non-negative matrix to the product of two non-negative ones.
- Linear Discriminant Analysis (LDA). LDA is a generalization of Fisher’s linear discriminant a method used for linear dimensional reduction.
- Generalized Discriminant Analysis (GDA). GDA is a nonlinear dimensionality reduction technique that uses kernel function to map high-dimensional datapoints to low space.
- Autoencoder is a nonlinear dimensionality reduction technique implemented through neural networks.
- T-distributed Stochastic Neighbor Embedding (t-SNE). The T-distributed Stochastic Neighbor Embedding (t-SNE) is a nonlinear dimensionality reduction technique commonly used for visualizing high-dimensional data.
2. Numerosity Reduction. This involves reducing the volume of the data. It can be parametric such as regression method where a model is fitted and learned parameters are stored for use or can be non-parametric methods such as clustering, sampling etc. where no model is assumed.
3. Statistical Modelling. A statistical model can be used to reduce the dataset based on principles such as likelihood, equivariance etc.
Data reduction is important in machine learning modelling and analysis of high-dimensional data. Reduction involves either reducing the volume of data or the number of features. Feature reduction is important in machine learning modelling where the data with large number of features (high dimension) is mapped to low dimension. Reduction in data volume helps in efficiently storing the data. In this post we have looked at data reduction and different types of reduction. In the next post we will look at handling missing values. To learn about data transformation check our previous post here.