Feature engineering is the process of using domain knowledge and scientific techniques to select and transform the most important features/variables that will yield an optimum machine learning or statistical model. The main goal of feature engineering is to increase the quality of the model. Feature engineering tasks may include reducing the dimensionality of the data, encoding categorical variables, scaling numerical features, creating new features among other techniques. In Machine Learning development feature engineering is sometimes considered as a step in data pre-processing, and some techniques will overlap between the two phases of data preparation. Feature engineering is a vast area with different techniques depending on the data and business problem being solved. In this post we introduce feature engineering and it’s most common techniques.
Feature Engineering Process
Feature engineering process can vary depending on the nature of data, quality of data and the business case. However, it involves the knowledge of domain expert in identifying useful features and performing out statistical tests on the identified features to determine their predictive power on the model. Below is a general process of feature engineering.
- Discussing to identify which features are important. This is done by domain experts.
- Determining which new features can be created from existing features. This may include extracting year from datetime, extracting titles from texts etc.
- Testing the importance of features and selecting the most relevant features. This can be done using statistical tests to determine their significance.
- Transforming the selected features. This is useful to ensure machine learning model can properly ingest the data.
- Re-engineer the features again. If the performance of the model is not satisfactory we can repeat the process to select other features or create new features with the same process.
Feature Engineering Techniques
The techniques for engineering features for machine learning modelling can vary depending on the problem solved, data available and the type of algorithm used. Below are the common techniques for feature engineering.
- Dimensionality reduction. Too many features may results to poor model due to overfitting as the model will end up having high variance. Hence we need to reduce the number of features.
- Encoding categorical features. Categorical variables needs to be encoded to binary variables for optimum model performance.
- Feature scaling. Numerical variables needs to be normalized into a uniform scale. This improves the performance of the model and the quality of prediction.
- Linear transformation. This may be useful when the data is non-linear and we intend to use a linear model.
- Feature importance. This is a technique of identifying which features have high predictive power and should be prioritized during modelling. Highly correlated features might also be dropped.
- Data leakage detection. This is important to test which features are highly correlated with the target variable, are proxies of target variable or are duplicate of the target variable. This helps in preventing data leakage.
Dealing with Features Explosion
Feature explosion is a problem where the number of features grows significantly to the extent of impacting the modelling process. This can be caused during feature encoding. Encoding techniques such as one-hot-encoding can results to many new features in the dataset. Another cause can be through linear combination of features. Below are ways to deal with feature explosion.
- Feature selection. We can only focus on the most important features.
- Regularization. This is a techniques of penalizing the coefficients with large errors with a core objective of balancing the bias and variance of the model.
- Kernel methods. In this techniques we only focus on features close to the decision boundary.
At the heart of very machine learning model is the process of feature engineering. Feature engineering is the use of domain knowledge and statistical techniques to select and transform the most optimal variables in a dataset for that will improve the performance and prediction of the predictive model. In this post we have looked at the most common feature engineering techniques such as feature encoding, feature selection, dimensionality reduction and linear transformation among others. In this series of feature engineering posts we will learn about each of the most common feature engineering techniques in details and implement it in Python. In the next post we will look at Dimensionality Reduction and it’s useful in machine learning. To learn about Data Processing in Big Data with Spark check our previous post here.