designing_good_validation.md

Intro

One common error in Kaggle competition is to rely on public learderboard ranking. In fact, the private leardeboard ranking can be quite different (since test sets are chosed randomly. In code competitions that actual test set is not even given). This indicates the importance of validation in data science competitions.
Monitoring your performances when modeling and distinguishing when overfitting happens is a key competency not only in data science competitions but in all data science projects.

Train & test datasets in Kaggle

In order for a competition to work properly, training data and test data should be from the same distribution.
- Moreover, the private and public parts of the test data should resemble each other in terms of distribution.
Even if the training and test data are apparently from the same distribution, the lack of sufficient examples in either set could make it difficult to obtain aligned results between the training data and the public and private test data.
The public test data should be regarded as a holdout test in a data science project: to be used only for final validation. Hence, it should not be queried much in order to avoid what is called adaptive overfitting, which implies a model that works well on a specific test set but underperforms on others. The dangers of overfitting

Shake-ups

The above considerations are the main reason for the shake-ups in the rankings, which is commonly attributed to the differences between the training and test sets or between the private and public parts of test data.
A general shake-up is calculated like this: mean(abs(private_rank - public_rank)/number_of_teams). See more here and here.
There is little adaptive overfitting; in other words, public standings usually do hold in the unveiled private leaderboard.
Most shake-ups are due to random fluctuations and overcrowded rankings where competitors are too near to each other, and any slight change in the performance in the private test sets causes major changes in the rankings.
Shake-ups happen when the training set is very small or the training data is not independent and identically distributed (i.i.d.). Paper: A Meta-Analysis of Overfitting in Machine Learning

Suggested strategy

Here, we suggest a strategy that is a bit more sophisticated than simply following what happens on the public leaderboard:

Always build reliable cross-validation systems for local scoring.
Always try to control non-i.i.d distributions using the best validation scheme dictated by the situation. Unless clearly stated in the description of the competition, it is not an easy task to spot non-i.i.d. distributions, but you can get hints from discussion or by experimenting using stratified validation schemes (when stratifying according to a certain feature, the results improve decisively, for instance).
Correlate local scoring with the public leaderboard in order to figure out whether or not they go in the same direction.
Test using adversarial validation, revealing whether or not the test distribution is similar to the training data.
Make your solutions more robust using ensembling, especially if you are working with small datasets.

The importance of validation in competitions

If you think about a competition carefully, you can imagine it as a huge system of experiments. Whoever can create the most systematic and efficient way to run these experiments wins.

Systemic Experimentation

The key to successful participation resides in the number of experiments you conduct and the way you run all of them.
The way you run your experiments also has an impact. Fail fast and learn from it is an important factor in a competition.
Having a proper validation strategy is the great discriminator between successful Kaggle competitors and those who just overfit the leaderboard and end up in lower-than-expected rankings after the competition. Validation helps you experiment in the right direction.
Though the temptation to submit your top public leaderboard models may be high, ***always consider your own validation scores.***
- For your final submissions, depending on the situation and whether or not you trust the leaderboard, choose your best model based on the leaderboard and your best based on your local validation results. If you don’t trust the leaderboard (especially when the training sample is small or the examples are non-i.i.d.), submit models that have two of the best validation scores, picking two very different models or ensembles. In this way, you will reduce the risk of choosing solutions that won’t perform on the private test set.

Bias and variance

A good validation system helps you with metrics that are more reliable than the error measures you get from your training set.
- In fact, metrics obtained on the training set are affected by the capacity and complexity of each model.
- You can think of the capacity of a model as its memory that it can use to learn from data.
Models can be reduced to mathematical functions that map an input (the observed data) to a result (the predictions).
- If the mathematical function of a model is not complex or expressive enough to capture the complexity of the problem you are trying to solve, we talk of bias, because your predictions will be limited (“biased”) by the limits of the model itself.
- If the mathematical function at the core of a model is too complex for the problem at hand, we have a variance problem, because the model will record more details and noise in the training data than needed and its predictions will be deeply influenced by them and become erratic.
- Note: Nowadays, given the advances in machine learning and the available computation resources, the problem is always due to variance.
  - The reason is deep neural networks and gradient boosting, the most commonly used solutions, often have a mathematical expressiveness that exceeds what most of the problems you will face need in order to be solved.
- The process of learning elements of the training set that have no generalization value is commonly called overfitting.
  - The core purpose of validation is to explicitly define a score or loss value that separates the generalizable part of that value from that due to overfitting the training set characteristics. This is the validation loss.

You can hear about overfitting at various levels:
- At the level of the training data, when you use a model that is too complex for the problem.
- At the level of the validation set itself, when you tune your model too much with respect to a specific validation set.
- At the level of the public leaderboard, when your results are far from what you would expect from your training.
- At the level of the private leaderboard, when in spite of the good results on the public leaderboard, your private scores will be disappointing

Trying different splitting strategies

To summarize the strategies for validating your model and measuring its performance correctly, you have a couple of choices:

The first choice is to work with a holdout system, incurring the risk of not properly choosing a representative sample of the data or overfitting to your validation holdout.
The second option is to use a probabilistic approach and rely on a series of samples to draw your conclusions on your models.
- Among the probabilistic approaches, you have cross-validation, ***leave-one-out (LOO)***, and bootstrap.
- Among cross-validation strategies, there are different nuances depending on the sampling strategies you take based on the characteristics of your data:
  - Simple random sampling
  - Stratified sampling
  - Sampling by groups
  - Time sampling
Sampling is at the root of statistics and it is not an exact procedure because, based on your sampling method, your available data, and the randomness of picking up certain cases as part of your sample, you will experience a certain degree of error.
- For instance, if you rely on a biased sample, your evaluation metric may be estimated incorrectly (over- or under-estimated).
The other aspect that all these strategies have in common is that they are partitions, which divide cases in an exclusive way as either part of the training or part of the validation.

The basic train-test split

In this strategy, you sample a portion of your training set (also known as the holdout) and you use it as a test set for all the models that you train using the remaining part of the data.
Great advantage: It is very simple. In Scikit-learn, you can use the train_test_split function.

Notes on using train_test_split:

When you have large amounts of data, you can expect that the test data you extract is similar to (representative of) the original distribution on the entire dataset.
- However, since the extraction process is based on randomness, you always have the chance of extracting a non-representative sample.
- In particular, the chance increases if the training sample you start from is small.
- Comparing the extracted holdout partition using adversarial validation can help you to make sure you are evaluating your efforts in a correct way.
In addition, to ensure that your test sampling is representative, especially with regard to how the training data relates to the target variable, you can use stratification, which ensures that the proportions of certain features are respected in the sampled data.
- You can use the stratify parameter in the train_test_split function and provide an array containing the class distribution to preserve.

Note: Even if you have a representative holdout available, sometimes a simple train-test split is not enough for ensuring a correct tracking of your efforts in a competition.

In fact, as you keep checking on this test set, you may drive your choices to some kind of adaptation overfitting (in other words, erroneously picking up the noise of the training set as signals), as happens when you frequently evaluate on the public leaderboard.
For this reason, a probabilistic evaluation, though more computationally expensive, is more suited for a competition.

Probabilistic evaluation methods

Probabilistic evaluation of the performance of a learning model is based on the statistical properties of a sample from a distribution.
- By sampling, you create a smaller set of your original data that is expected to have the same characteristics.
By training and testing your model on this sampled data and repeating this procedure a large number of times, you are basically creating a statistical estimator measuring the performance of your model.
Every sample may have some error in it; i.e. it may not be fully representative of the true distribution of the original data.
- However, as you sample more, the mean of your estimators on these multiple samples will converge to the true mean of the measure you’re estimating.
- This is by the Law of Large Numbers theorem.

k-fold cross validation

The most used probabilistic validation method. Paper: Cross-validation: what does it estimate and how well does it do it?
k-fold can be used to compare predictive models as well as selecting the hyperparameters.
There are quite a few different variations of k-fold cross-validation, but the simplest one is the KFold in Scikit-learn.
- Split training data into $k$ partitions, for $k$ iterations, one of the $k$ partitions is taken as test set while others used for training, the $k$ validation scores are then averaged (i.e. k-fold validation score) which tells you the estimated average model performance.
- The standard deviation of the scores will inform you about the uncertainty of the estimate.

Note: One important aspect of the k-fold CV score is that it estimates the average score of a model trained on same quantity of data as $k-1$ folds. If, afterward, you train your model on all your data, the previous validation estimate no longer holds.
- As k approaches the number n of examples, you have an increasingly correct estimate of the model derived on the full training set, yet, due to the growing correlation between the estimates you obtain from each fold, you will lose all the probabilistic estimates of the validation.
Note: When you reach $k=n$ , you have the LOO validation method, which is useful when you have a few cases available.
- The method is mostly an unbiased fitting measure since it uses almost all the available data for training and just one example for testing. Yet, it is not a good estimate of the expected performance on unseen data. The repeated scores over the dataset are highly correlated.

The correct k number of partitions

The smaller the k (the minimum is 2), the smaller each fold will be, and consequently, the more bias in learning there will be for a model trained on k - 1 folds: your model validated on a smaller k will be less well-performing with respect to a model trained on a larger k.
The higher the k, the more the data, yet the more correlated your validation estimates: you will lose the interesting properties of k-fold cross-validation in estimating the performance on unseen data.
Note: Commonly, $k$ is set to 5, 7, or 10, more seldom to 20 folds. $k=5$ or $k=10$ are good choices for a competition.
- Since $k=10$ uses more data for training (90% of available data), it’s more suitable for figuring out the performance of your model when you re-train on the full dataset.

Two important considerations

The choice of $k$ should reflect your goals:
- If your purpose is performance estimation, you need models with low bias estimates (i.e. no systemic distortion of estimates). You can achieve this by using a higher number of folds, usually between 10 and 20.
- If your aim is parameter tuning, you need a mix of bias and variance, so a medium $k$ (between 5 and 7) would be a good choice.
- If your purpose is just to apply variable selection and simplify your dataset, you need models with low variance estimates. Hence, a lower number of folds will suffice (between 3 and 5).
- Note: When the size of the available data is quite large, you can safely stay on the lower side of the suggested bands.
If you are just aiming for performance estimation, consider that the more folds you use, the fewer cases you will have in your validation set, so the more the estimates of each fold will be correlated.
- Beyond a certain point, increasing $k$ renders your cross-validation estimates less predictive of unseen test sets and more representative of an estimate of how well-performing your model is on your training set.
- This also means that, with more folds, you can get the perfect out-of-fold prediction for stacking purposes.

k-fold CV to produce your predictions: Many Kaggle competitors use the models built during cross-validation to provide a series of predictions on the test set that, averaged, will provide them with the solution.

k-fold variations

Since it is based on random sampling, k-fold can provide unsuitable splits when:

You have to preserve the proportion of small classes, both at the target and feature levels. This is typical when your target is highly imbalanced. Example, spam datasets, any credit risk dataset.
You have to preserve the distribution of a numeric variable, both at the target and feature levels. This is typical of regression problems where the distribution is quite skewed or you have heavy, long tails. Example, house price prediction, where you have a consistent small proportion of houses on sale that will cost much more than the average house.
Your cases are not i.i.d, in particular when dealing with time series forecasting.

Stratified k-fold

The sampling is done in a controlled way that preserves the distribution you want to preserve.
Use StratifiedKFold from Scikit-learn, using a stratification variable (usually your target).
- Other methods are pandas.cut and KBinsDiscretizer (from Scikit-learn).

k-fold stratification based on multiple variables

You can find solution in the Scikit-multilearn package, here. In particular, the IterativeStratified command that helps you to control the order (the number of combined proportions of multiple variables) that you want to perserve, see here.
- Paper: On the Stratification of Multi-Label Data
- Paper: A Network Perspective on Stratification of Multi-Label Data.

k-fold stratification for regression

You can actually make good use of stratification even when your problem is not a classification, but a regression.
You have to use a discrete proxy for your target instead of your continuous target.
The simplest way is to use pandas cut function and divide your target into large enough number of bins.

import pandas as pd
y_proxy = pd.cut(y_train, bins=10, labels=False)

In order to determine the number of bins, you could use Sturges’ rule based on the number of examples available, example.

import numpy as np
bins = int(np.floor(1 + np.log2(len(X_train))))

Another alternative approach is to focus on the distributions of the features in the training set and aim to reproduce them.
- This requires use of cluster analysis.
- The predicted clusters are used as strata.
- Example $\rightarrow$ first PCA is used to remove correlations, then a $k$ -means clustering is performed.

Non-i.i.d data

Non-i.i.d data can happen in cases where there is some grouping between the examples.
The problem with non-i.i.d data is that the features and targets are correlated between the examples.
The solution here is to use GroupKFold by providing the grouping variable. It ensures the groups won’t split between training and validation datasets.
Note: Discovering groupings in the data is not an easy task and requires some effort to identify.
Note: Time series data present the same non-i.i.d problem due to auto-correlation. In time series, you must split based on time.

For a more complex approach, you can use Scikit-learn’s TimeSeriesSplit method.

Growing training set and a moving validation set

Training (fixed lookback) and validation splits are moving over time

Note: Going by a fixed lookback helps to provide a fairer evaluation of time series models since you are always counting on the same training set size.
Note: Finally, remember that TimeSeriesSplit can be set to keep a pre-defined gap between your training and test time.
- This is extremely useful when you are told that the test set is a certain amount of time in the future (for instance, a month after the training data) and you want to test if your model is robust enough to predict that far into the future.

Nested cross-validation

Sometimes (when tuning hyperparameters) you to test your model’s performance with respect to their intermediate metrics (and not the final metric).
In this case, you have to distinguish between a validation set, which is used to evaluate the performance of various models and hyperparameters, and a test set, which will help you to estimate the final performance of the model.
- If you are using a test-train split, this is achieved by splitting the test part into two new parts. (The usual split is 70/20/10 for training, validation, and testing.
Nested CV is cross-validation based on the split of another cross-validation.
- Essentially, you run your usual cross-validation (external), but when you have to evaluate different models or different parameters, you run cross-validation based on the fold split (internal).

==Note: There are couple of problems with this approach:
- A reduced training set, since you first split by cross-validation, and then you split again.
- More importantly, it requires a huge amount of model building: if you run two nested 10-fold cross-validations, you’ll need to run 100 models.
Especially for this last reason, some Kagglers tend to ignore nested cross-validation and risk some adaptive fitting by using the same cross-validation for both model/parameter search and performance evaluation, or using a fixed test sample for the final evaluation.
- Having said that, remember that using nested cross-validation, whenever possible, can provide you with a less overfitting solution and could make the difference in certain competitions.

Producing out-of-fold predictions (OOF)

An interesting application of CV (besides model evaluation) is producing test predictions and out-of-fold predictions.
In fact, as you do CV, you can:
- Predict on the test set: The average of all predictions is often more effective than re-training the same model on all data. This is an ensembling technique related to blending.
- Predict on the validation set: In the end, you will have predictions for the entire training set and can re-order them in the same order as the original training data. These predictions are commonly referred to as out-of-fold (OOF) predictions and they can be extremely useful.
The first use of OOF predictions is to estimate your performance, since you can compute your evaluation metric directly on the OOF predictions.
- The performance obtained is different from the cross-validated estimates (based on sampling). It doesn’t have the same probabilistic characteristics, so it is not a valid way to measure generalization performance, but it can inform you about the performance of your model on the specific set you are training on.
A second use of OOF predictions is to produce a plot and visualize predictions against the ground truth values (or predictions from other models).
- This can be used to create meta-features or meta-predictors.
Note: Since every prediction in your OOF predictions has been generated by a model trained on different data, these predictions are unbiased and you can use them without any fear of overfitting.
Generating OOF predictions can be done in two ways:
- By coding a procedure that stores the validation predictions into a prediction vector, taking care to arrange them in the same index position as the examples in the training data.
- By using the Scikit-learn function cross_val_predict, which will automatically generate the OOF predictions for you.

Subsampling

Another strategy for subsampling is subsampling.
Subsampling is similar to $k$ -fold, but you do not have fixed folds; you use as many as you think are necessary (in other words, take an educated guess).
You repetitively subsample your data, using the sampled data for training and the remaining for validation.
By averaging the evaluation metrics of all the subsamples, you will get a validation estimate of the performances of your model.
You can use Scikit-learn’s ShuffleSplit for this sort of validation.

The bootstrap

Another option is to try bootstrap, which is a statistical method for concluding the error distribution of an estimate. For the same reason, it can be used for performance estimation.
The bootstrap requires you to draw a sample, with replacement, that is the same size as the available data.
At this point, you can use the bootstrap in two ways:
1. As in statistics, you can bootstrap multiple times, train your model on the samples, and compute your evaluation metric on the training data itself. The average of the bootstraps will provide your final evaluation.
2. Otherwise, as in subsampling, you can use the bootstrapped sample for your training and what is left not sampled from the data as your test set.
Note: This method is more suitable for its statistical applications, and not much less useful for machine learning, mainly because most ML models tend to overfit.
- For this reason, Efron and Tibshirani, Improvements on cross-validation: the 632+ bootstrap method, proposed the 632+ estimator as a final validation metric.

$err_{0.632} + (1-w).err_{\text{fit}} + w . err_{\text{bootstrap}}$

where

$w = \frac{0.632}{1 - 0.632R}$

$R = \frac{err_{\text{bootstrap}} - err_{\text{fit}}}{\gamma - err_{\text{fit}}}$

$err_{\text{fit}}$ is your metric computed on the training data,
$err_{\text{bootstrap}}$ is the metric computed on the bootstrapped data.
$\gamma$ is the no-information error rate, estimated by evaluating the prediction model on all possible combinations of targets and predictors.
- Calculating $\gamma$ is intractable, see more.

Given the limits and intractability of using the bootstrap as in classical statistics for machine learning applications, you can instead use the second method, getting your evaluation from the examples left not sampled by the bootstrap.

Note: As with subsampling, this method requires building many more models and testing them than for $k$ -fold CV.
There was an implementation of bootstrap method for CV on Scikit-learn, but was then removed. Below is another implementation:

import random

def Bootstrap(n, n_iter=3, random_state=None):
	"""
	Random sampling with replacement cross-validation generator.
	For each iter a sample bootstrap of the indexes [0, n) is 
	generated and the functions returns the obtained sample 
	and a list of all the excluded indexes.
	"""
	if random_state"
		random.seed(random_state)
	for j in range(n_iter):
		bs = [random.randint(0, n-1) for i in range(n)]
		out_bs = list({i for i in range(n)} - set(bs))
		yield bs, out_bs

***In conclusion, the bootstrap is indeed an alternative to cross-validation. It is certainly more widely used in statistics and finance. In machine learning, the golden rule is to use the k-fold cross-validation approach. However, we suggest not forgetting about the bootstrap in all those situations where, due to outliers or a few examples that are too heterogeneous, you have a large standard error of the evaluation metric in cross-validation. In these cases, the bootstrap will prove much more useful in validating your models. ***

Tuning your model validation system

As a golden rule, be guided in devising your validation strategy by the idea that you have to replicate the same approach used by the organizers of the competition to split the data into training, private, and public test sets.
Ask yourself how the organizers have arranged those splits.
- Did they draw a random sample?
- Did they try to preserve some specific distribution in the data?
- Are the test sets actually drawn from the same distribution as the training data?
If you focus on this idea from the beginning, you will have more of a chance of finding out the best validation strategy, which will help you rank more highly in the competition.
Note: These are not the questions you would ask yourself in a real-world project. Contrary to real-world projects, competitions have a much narrower focus.
==Since this is a trial-and-error process, apply the following two consistency checks in order to figure out if you are on the right path:
1. First, you have to check if your local tests are consistent, that is, that the single cross-validation fold errors are not so different from each other or, when you opt for a simple train-test split, that the same results are reproducible using different train-test splits.
  - If you’re failing this check, you have a few options depending on the following possible origins of the problem:
    - You don’t have much training data.
    - The data is too diverse and every training partition is very different from every other (for instance, if you have too many high cardinality features, that is, features with many levels - like zip codes - or if you have multivariate outliers).
  - In both cases, the point is you lack data.
  - In this case, unless you find out that moving to a simpler algorithm works on the evaluation metric (in which case trading variance for bias may worsen your model’s performance, but not always), your best choice is to use an extensive validation approach. This can be implemented by:
    - Using larger $k$ values (thus approaching LOO where $k = n$ ). Your validation results will be less about the capability of your model to perform on unseen data, but by using larger training portions, you will have the advantage of more stable evaluations.
    - Averaging the results of multiple $k$ -fold validations (based on different data partitions picked by different random seed initializations).
    - Using repetitive bootstrapping.
  - Keep in mind that when you find unstable local validation results, you won’t be the only one to suffer from the problem.
  - Usually, this is a common problem due to the data’s origin and characteristics.
  - By keeping tuned in to the discussion forums, you may get hints at possible solutions.
    - For instance, a good solution for high cardinality features is target encoding; stratification can help with outliers; and so on.
2. Then, you have to check if your local validation error is consistent with the results on the public leaderboard.
  - Here, your local cross-validation is consistent but you find that it doesn’t hold on the leaderboard.
  - In order to realize this problem exists, you have to keep diligent note of all your experiments, validation test types, random seeds used, and leaderboard results if you submitted the resulting predictions.
    - In this way, you can draw a simple scatterplot and try fitting a linear regression or, even simpler, compute a correlation between your local results and the associated public leaderboard scores.
  - It costs some time and patience to annotate and analyze all of these, but it is the most important meta-analysis of your competition performances that you can keep track of.
  - Once verified it is the second case, you actually have a strong signal that something is missing from your validation strategy. Although, you can still work on improving your model, but the improvement won’t be proportional to your ranking on learderboard.
    - However, systematic differences are always a red flag, implying something is different between what you are doing and what the organizers have arranged for testing the model.
    - An even worse scenario occurs when your local cross-validation scores do not correlate at all with the leaderboard feedback. This is really a red flag.
    - When you realize this is the case, you should immediately run a series of tests and investigations in order to figure out why, because, regardless of whether it is a common problem or not, the situation poses a serious threat to your final rankings. There are a few possibilities in such a scenario:
      - You figure out that the test set is drawn from a different distribution to the training set. The adversarial validation test is the method that can enlighten you in such a situation.
      - The data is non-i.i.d. but this is not explicit. For instance, in The Nature Conservancy Fisheries Monitoring competition, images in the training set were taken from similar situations (fishing boats). You had to figure out by yourself how to arrange them in order to avoid the model learning to identify the target rather than the context of the images (see, for instance, this work by Anokas).
      - The multivariate distribution of the features is the same, but some groups are distributed differently in the test set. If you can figure out the differences, you can set your training set and your validation accordingly and gain an edge. You need to probe the public leaderboard to work this out. probing the leaderboard is the act of making specifically devised submissions in order to get insights about the composition of the public test set. It works particularly well if the private test set is similar to the public one. There are no general method for probing, you have to devise your own. LB probing example 1, LB probing example 2, LB probing example 3, LB probing example 4.
      - The test data is drifted or trended, which is usually the case in time series predictions. Again, you need to probe the public leaderboard to get some insight about some possible post-processing that could help your score, for instance, applying a multiplier to your predictions, thus mimicking a decreasing or increasing trend in the test data.

Using adversarial validation

As we have discussed, cross-validation allows you to test your model’s ability to generalize to unseen datasets coming from the same distribution as your training data. In reality, this is not always the case.
This could happen in the event that the test set is even slightly different from the training set on which you have based your model.
Hence, it is not enough to avoid overfitting to the leaderboard, but, in the first place, it is also advisable to find out if your test data is comparable to the training data.
Adversarial validation has been developed just for this purpose. It’s a technique allowing you to easily estimate the degree of difference between your training and test data.
The idea is simple:
- Take your training data, remove the target, assemble your training data together with your test data, and create a new binary classification target where the positive label is assigned to the test data.
- At this point, run a machine learning classifier and evaluate for the ROC-AUC evaluation metric.
- If your ROC-AUC is around 0.5, it means that the training and test data are not easily distinguishable and are apparently from the same distribution.
- Example notebook
Note: Since there might be missing values, you need to do some data processing before running the classifier. It’s recommended to use the random forest classifier because:
- It doesn’t output true probabilities but its results are intended as simply ordinal, which is a perfect fit for an ROC-AUC score.
- The random forest is a flexible algorithm based on decision trees that can do feature selection by itself and operate on different types of features without any pre-processing, while rendering all the data numeric. It is also quite robust to overfitting and you don’t have to think too much about fixing its hyperparameters.
- You don’t need much data processing because of its tree-based nature. For missing data, you can simply replace the values with an improbable negative value such as -999, and you can deal with string variables by converting their strings into numbers (for instance, using the Scikit-learn label encoder, sklearn.preprocessing.LabelEncoder). As a solution, it performs less well than one-hot encoding, but it is very speedy and it will work properly for the problem.
Note: Although building a classification model is the most direct way to adversarially validate, you can also use other approaches.
- One approach is to map both training and test data into a lower-dimensional space.
- Example notebook
- The advantage here is that you can graphically represent data using methods such as t-SNE or PCA.
  - UMAP can offer faster low-dimensionality solution with clear and distinct data clusters.
  - Variational AutoEncoders (VAE) can deal with non-linear reduction and offer a more useful representation than PCA. More complicated to setup though.

Example implementation

This implementation is based on this competition

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_auc_score
train = pd.read_csv("../input/tabular-playground-series-jan-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-jan-2021/test.csv")

# Data preparation is short and to the point. Since all features are numeric, 
# you won’t need any label encoding, but you do have to fill any missing values
# with a negative number (-1 usually works fine), and drop the target and also any identifiers.

train = train.fillna(-1).drop(["id", "target"], axis=1)
test = test.fillna(-1).drop(["id", axis=1])
X = train.append(test)
y = [0] * len(train) + [1] * len(test)

# At this point, you just need to generate `RandomForestClassifier` predictions 
# for your data using the `cross_val_predict` function, which automatically
# creates a cross-validation scheme and stores the predictions on the validation fold:

model = RandomForestClassifier()
cv_preds = cross_val_predict(model, X, y, cv=5, n_jobs=-1, method='predict_proba')

# As a result, you obtain predictions that are unbiased (they are not overfit as
# you did not predict on what you trained) and that can be used for error
# estimation. Please note that `cross_val_predict` won’t fit your instantiated
# model, so you won’t get any information from it, such as what the important
# features used by the model are. If you need such information, you just need to
# fit it first by calling `model.fit(X, y)`.

print(roc_auc_score(y_true=y, y_score=cv_preds[:,1]))

# You should obtain a value of around 0.49-0.50 (`cross_val_predict` won’t be
# deterministic unless you use cross-validation with a fixed `random_seed`).

Handling different distributions of training and test data

ROC-AUC scores of 0.8 or more would alert you that the test set is peculiar and quite distinguishable from the training data.
In such cases, you have a few strategies:
- Suppression
- Training on cases most similar to the test set
- Validating by mimicking the test set

Suppression

You remove the variables that most influence the result in the adversarial test set until the distributions are the same again.
To do so, you need an iterative approach.
1. You fit your model to all your data, and then you check the importance measures (feature_importance_ from the Scikit-learn RandomForest) and the ROC-AUC fit score.
2. At this point, you remove the most important variable and run everything again.
3. Repeat step 1 & 2 until the fitted ROC-AUC score decreases to around 0.5.
Note: The only problem with this method is that you may actually be forced to remove the majority of important variables from your data.
- Any model you then build on such variable censored data won’t be able to predict sufficiently correctly due to the lack of informative features.

Train on the examples most similar to test set

In this approach, you focus on the samples you use for training instead of features.
You pick up from the training set only the samples that fit the test distribution.
- Note: Any trained model then suits the testing distribution (but it won’t be generalizable to anything else), which should allow you to test the best on the competition problem.
The limitation of this approach is that you are cutting down the size of your dataset, and depending on the number of samples remained, you may suffer from a very biased resulting model.
- In the previous example, picking up just the adversarial predictions on the training data that exceed a probability of 0.5 and summing them results in picking only 1,495 cases (the number is so small because the test set is not very different from the training set):

print(np.sum(cv_preds[:len(X), 1] > 0.5))

Validating by mimicking the test set

You keep on training on all the data, but for validation purposes, you pick your examples only from the adversarial predictions on the training set that exceed a probability of 0.5 (or an even higher threshold such as 0.9).
Having a validation set tuned to the test set will allow you to pick all the possible hyperparameters and model choices that will favor a better result on the leaderboard.
- In the previous example, we can figure out that feature_19 and feature_54 appear the most different between the training/test split from the output of the following code:

model.fit(X, y)
ranks = sorted(list(zip(X.columns, model.feature_importances_)), 
               key=lambda x: x[1], reverse=True)
for feature, score in ranks:
    print(f"{feature:10} : {score:0.4f}")

Concluding remarks on adversarial validation

First, using it will generally help you to perform better in competitions, but not always. Kaggle’s Code competitions, and other competitions where you cannot fully access the test set, cannot be inspected by adversarial validation.
In addition, adversarial validation can inform you about the test data as a whole, but it cannot advise you on the split between the private and the public test data, which is the cause of the most common form of public leaderboard overfitting and consequent shake-up.
Finally, adversarial validation, though a very specific method devised for competitions, has quite a few practical use cases in the real world:
- How often have you picked the wrong test set to validate your models? The method we have presented here can enlighten you about whether you are using the test data, and any validation data, in your projects properly.
- Moreover, data changes and models in production may be affected by such changes and produce bad predictions if you don’t retrain them. This is called concept drift, and by using adversarial validation, you can immediately understand if you have to retrain new models to put into production or if you can leave the previous ones in operation.

Handling leakage

Leakage (also referred to as golden features) involves information in the training phase that won’t be available at prediction time.
The presence of such information (leakage) will make your model over-perform in training and testing, allowing you to rank highly in the competition, but will render unusable or at best suboptimal any solution based on it from the sponsor’s point of view.
We can define leakage as ***“when information concerning the ground truth is artificially and unintentionally introduced within the training feature data, or training metadata”***.
Leakage is often found in Kaggle competitions.
Note: Don’t confuse data leakage with a leaky validation strategy:
- In a leaky validation strategy, the problem is that you have arranged your validation strategy in a way that favors better validation scores because some information leaks from the training data. It has nothing to do with the competition itself, but it relates to how you are handling your validation.
- It occurs if you run any pre-processing modifying your data (normalization, dimensionality reduction, missing value imputation) before separating training and validation or test data.
- In order to prevent leaky validation, if you are using Scikit-learn to manipulate and process your data, you absolutely have to exclude your validation data from any fitting operation.
  - Fitting operations tend to create leakage if applied to any data you use for validation.
  - The best way to avoid this is to use Scikit-learn pipelines.
- Data leakage instead is therefore something that is not strictly related to validation operations, though it affects them deeply.
Generally speaking, leakage can originate at a feature or example level.

Feature leakage

This is by far the most common leakage.
It can be caused by the existence of a proxy for the target, or by a feature that is posterior to the target itself.
A target proxy could be anything derived from processing the label itself or from the test split process.
- For instance, when defining identifiers, specific identifiers (a numeration arc, for instance) may be associated with certain target responses, making it easier for a model to guess if properly fed with the information processed in the right way.
Leakage due to competition organizer’s mistake: A more subtle way in which data processing can cause leakage is when the competition organizers have processed the training and test set together before splitting it.
- Mishandled data preparation from organizers, especially when they operate on a combination of training and test data. Example leakage 1: organizers initially used features with aggregated historical data that leaked future information.
- Row order when it is connected to a time index or to specific data groups. Example leakage 2: the order of records in a feature hinted at proxy information, the location, which was not present in the data and which was very predictive.
- Column order when it is connected to a time index (you get hints by using the columns as rows).
- Feature duplication in consecutive rows because it can hint at examples with correlated responses. Example leakage 4.
- Image metadata. Example leakage 5.
- Hashes or other easily crackable anonymization practices of encodings and identifiers.

Posterior information

The trouble with posterior information originates from the way we deal with information when we do not consider the effects of time and of the sequence of cause and effect that spans across time.
- Since we are looking back at the past, we often forget that certain variables that make sense at the present moment do not have value in the past.
  - For instance, if you have to calculate a credit score for a loan to a new company, knowing that payments of the borrowed money are often late is a great indicator of the lower reliability and higher risk represented by the debtor, but you cannot know this before you have lent out the money.
- This is also a problem that you will commonly find when analyzing company databases in your projects: your query data will represent present situations, not past ones.
- Reconstructing past information can also be a difficult task if you cannot specify that you wish to retrieve only the information that was present at a certain time. For this reason, great effort has to be spent on finding these leaking features and excluding or adjusting them before building any model.
Similar problems are also common in Kaggle competitions based on the same kind of data (banking or insurance, for instance), though, since much care is put into the preparation of the data for the competition, they appear in more subtle ways and forms.
In general, it is easy to spot these leaking features since they strongly correlate with the target, and a domain expert can figure out why (for instance, knowing at what stage the data is recorded in the databases).
Therefore, in competitions, you never find such obvious features, but derivatives of them, often transformed or processed features that have slipped away from the control of the sponsor.
Since features are anonymized, they end up lurking among other examples. This has given rise to a series of hunts for the golden/magic features, a search to combine existing features in the dataset in order to have the leakage emerge. Read more, Another good example.

Training example leakage

This happens especially with non-i.i.d. data, i.e. some cases correlate between themselves because they are from the same period (or from contiguous ones) or the same group.
If such cases are not all together either in the training or test data, but separated between them, there is a high chance that the machine learning algorithm will learn how to spot the cases (and derive the predictions) rather than using general rules.
- An often-cited example of such a situation.
- A few real cases of leakage:
  - Case 1: the problem arose because of an imperfect train/test split methodology of the competition.
  - Case 2: a series of problems and non-i.i.d cases affected the correct train/test split of the competition.
  - Case 3: metadata (the creation time of each folder) did the trick.