More often Data Scientist and Machine learning engineers end up developing models that suffer from data leakage without easily noticing. The model performs perfectly well with high performance on validation set but fail while deployed to production. Data leakage is the use of information in the model training process which would not be expected to be available at prediction time causing the predictive scores (metrics) to overestimate the model’s utility when run in a production environment wikipedia. Data leakage can occur at any stage of machine learning process and in different situations such as using features for training the model while the same features will not be available in production or poorly sampling the training and validation datasets. Some overfitting models could be as a results of data leakage and needs proper model analysis. In this post we will look at various causes of data leakage and how to deal with it.
Categories of Leakage
- Feature Leakage. This is also known as target or data leakage and is as a results of having features in training set that will not be available in production for prediction. This is a serious problem as the model will be useless as it has learnt noise (unnecessary patterns). Feature leakage can be as a results of the following;
i. The feature is duplicate of the target
ii. The feature is a proxy or a function of the target
iii. The feature from the future will consistently have a specific pattern
Always check if there is feature in your dataset that perfectly resembles (or a duplicate) the target variable or is a function of the target variable, this will cause leakage in the model.
- Train-Test Contamination. This involves using training data for validating the model. The most common cause of this type of leakage is performing data pre-processing before splitting the datasets into training and validation sets. This mistake causes the validation set to have similar properties as those of training set making the model to perform perfectly on validation set but fail when in production.
- Data leakage during imputation. We can introduce data leakage when we compute a statistical imputation function such as mean for the whole data before we split our data into train and validation sets. The validation set will have the same properties as the training set and the model will be overly optimistic on validation data but in reality it might be weak in production.
Ways of Avoiding Data Leakage
- Critically analyse the model that seems too good to be true. A perfect model might be as a results of leakage.
- Only use the data/features that will be available in production. If you suspect that you have temporal data in training and it will not be available once the model is in production then the wise decision is to discard that data/features.
- Always split your training and testing dataset before pre-processing to avoid train-test contamination. If you pre-process the whole data and then split into training and validation set, the validation dataset would have an idea of how the training data looks like and them model will perfectly fit on validation set fooling us that it’s good only to fail in production. Efficiently using machine learning pipeline can avoid this problem.
- Use holdout dataset to validate your model before taking it to production. This will give you an idea if you’re falling into the trap for leakage. If you get the “too go to be good” performance then check again your model as there could be leakage causing that over optimistic performance.
- Remove leaky features. Features that are highly correlated with the target, are duplicate of the target or are function/proxy of the target are prime candidate for causing leakage and should be removed.
Conclusion
Data leakage is a serious problem that all Data Scientists and Machine Learning engineers face. Data leakage results in overly optimistic models on validation data but weak when taken to production. It can occur at any phase of the machine learning life cycle. We have looked at the common types of data leakage in machine learning; feature/target/data leakage, train-test contamination and the imputation data leakage. We have also looked at various ways you can avoid data leakage such as checking highly correlated features with the target variable, splitting the training and validation sets before data pre-processing among others. In the next post we will look at data augmentation and its significance in machine learning. To learn about Data Sampling and Resampling check our previous post here.