Always Creating

Data Leakage: An Elusive ML Mistake

June 01, 2020



Data leakage happens when you expose your model to a piece of information about the target variable which isn’t available in a real-world scenario. This makes it hard for the model to generalize on the real-world data.

You can fix issues such as model underfitting or overfitting since there are external signals [unusually low validation/test accuracy] available. But in case of data leakage, there are no obvious signals. Data leakage is elusive since it gives a false perception that the model works fine when the reality is altogether different. You realize the issue when the model is in production and isn’t performing well on real-world data. That’s too late!

Major Mistakes:

  • Not keeping a separate test set — The validation set is not enough for evaluating model performance. Since you tune your model on validation data, the training config and model hyperparameters are tuned to perform well on this dataset. You need to explicitly verify the performance on a holdout test data set before deploying the model.

  • Normalizing the whole dataset without splitting it first. Performing feature encoding techniques such as target encoding on the whole dataset.

  • Using features that won’t be available in the real-world dataset at the time of prediction. This generally happens when you use as input a feature that is post-dependent on the target variable. A silly example: you are predicting the selling price of a car, but using a feature such as a car insurance cost which generally comes into the picture after buying a car. [thus highly correlated with the selling price of the car]

How to mitigate it:

  • Split dataset into test, validation & train sets before any feature engineering.
  • Carefully think of an effective way to split the dataset, so that it reflects real-world scenario ( eg: use temporal splits for time series data, for imbalanced data stratify with respect to target variable .. )
  • Explicitly check for data redundancy. [ You don’t want the same data points both in train and validation data]
  • Do feature engineering on validation and test data based on features on train data only. eg: While normalizing datasets use mean & standard deviation of train data only for each of the datasets
mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std
test_data -= mean
test_data /= std

References: https://www.goodreads.com/book/show/33986067-deep-learning-with-python


All Articles