Intro

Train & test datasets in Kaggle

  1. In order for a competition to work properly, training data and test data should be from the same distribution.
    • Moreover, the private and public parts of the test data should resemble each other in terms of distribution.
  2. Even if the training and test data are apparently from the same distribution, the lack of sufficient examples in either set could make it difficult to obtain aligned results between the training data and the public and private test data.
  3. The public test data should be regarded as a holdout test in a data science project: to be used only for final validation. Hence, it should not be queried much in order to avoid what is called adaptive overfitting, which implies a model that works well on a specific test set but underperforms on others. The dangers of overfitting

Shake-ups

Suggested strategy

Here, we suggest a strategy that is a bit more sophisticated than simply following what happens on the public leaderboard:

The importance of validation in competitions

If you think about a competition carefully, you can imagine it as a huge system of experiments. Whoever can create the most systematic and efficient way to run these experiments wins.

Systemic Experimentation

Bias and variance

Screen Shot 2022-07-04 at 10.15.03 AM

Trying different splitting strategies

To summarize the strategies for validating your model and measuring its performance correctly, you have a couple of choices:

The basic train-test split

Notes on using train_test_split:

  1. When you have large amounts of data, you can expect that the test data you extract is similar to (representative of) the original distribution on the entire dataset.
    • However, since the extraction process is based on randomness, you always have the chance of extracting a non-representative sample.
    • In particular, the chance increases if the training sample you start from is small.
    • Comparing the extracted holdout partition using adversarial validation can help you to make sure you are evaluating your efforts in a correct way.
  2. In addition, to ensure that your test sampling is representative, especially with regard to how the training data relates to the target variable, you can use stratification, which ensures that the proportions of certain features are respected in the sampled data.
    • You can use the stratify parameter in the train_test_split function and provide an array containing the class distribution to preserve.

Note: Even if you have a representative holdout available, sometimes a simple train-test split is not enough for ensuring a correct tracking of your efforts in a competition.

Probabilistic evaluation methods

k-fold cross validation

Screen Shot 2022-07-04 at 1.28.31 PM

The correct k number of partitions

Two important considerations

  1. The choice of kk should reflect your goals:

    • If your purpose is performance estimation, you need models with low bias estimates (i.e. no systemic distortion of estimates). You can achieve this by using a higher number of folds, usually between 10 and 20.
    • If your aim is parameter tuning, you need a mix of bias and variance, so a medium kk (between 5 and 7) would be a good choice.
    • If your purpose is just to apply variable selection and simplify your dataset, you need models with low variance estimates. Hence, a lower number of folds will suffice (between 3 and 5).
    • Note: When the size of the available data is quite large, you can safely stay on the lower side of the suggested bands.
  2. If you are just aiming for performance estimation, consider that the more folds you use, the fewer cases you will have in your validation set, so the more the estimates of each fold will be correlated.

    • Beyond a certain point, increasing kk renders your cross-validation estimates less predictive of unseen test sets and more representative of an estimate of how well-performing your model is on your training set.
    • This also means that, with more folds, you can get the perfect out-of-fold prediction for stacking purposes.

k-fold CV to produce your predictions: Many Kaggle competitors use the models built during cross-validation to provide a series of predictions on the test set that, averaged, will provide them with the solution.

k-fold variations

Since it is based on random sampling, k-fold can provide unsuitable splits when:

Stratified k-fold

k-fold stratification based on multiple variables

k-fold stratification for regression

import pandas as pd
y_proxy = pd.cut(y_train, bins=10, labels=False)
import numpy as np
bins = int(np.floor(1 + np.log2(len(X_train))))

Non-i.i.d data

Screen Shot 2022-07-04 at 6.50.31 PM

Growing training set and a moving validation set

Screen Shot 2022-07-04 at 6.54.34 PM

Training (fixed lookback) and validation splits are moving over time

Screen Shot 2022-07-04 at 6.55.08 PM

Nested cross-validation

Screen Shot 2022-07-04 at 7.06.06 PM

Producing out-of-fold predictions (OOF)

Subsampling

The bootstrap

err0.632+(1w).errfit+w.errbootstraperr_{0.632} + (1-w).err_{\text{fit}} + w . err_{\text{bootstrap}}

where

w=0.63210.632Rw = \frac{0.632}{1 - 0.632R}

R=errbootstraperrfitγerrfitR = \frac{err_{\text{bootstrap}} - err_{\text{fit}}}{\gamma - err_{\text{fit}}}

Given the limits and intractability of using the bootstrap as in classical statistics for machine learning applications, you can instead use the second method, getting your evaluation from the examples left not sampled by the bootstrap.

import random

def Bootstrap(n, n_iter=3, random_state=None):
	"""
	Random sampling with replacement cross-validation generator.
	For each iter a sample bootstrap of the indexes [0, n) is 
	generated and the functions returns the obtained sample 
	and a list of all the excluded indexes.
	"""
	if random_state"
		random.seed(random_state)
	for j in range(n_iter):
		bs = [random.randint(0, n-1) for i in range(n)]
		out_bs = list({i for i in range(n)} - set(bs))
		yield bs, out_bs 

***In conclusion, the bootstrap is indeed an alternative to cross-validation. It is certainly more widely used in statistics and finance. In machine learning, the golden rule is to use the k-fold cross-validation approach. However, we suggest not forgetting about the bootstrap in all those situations where, due to outliers or a few examples that are too heterogeneous, you have a large standard error of the evaluation metric in cross-validation. In these cases, the bootstrap will prove much more useful in validating your models. ***

Tuning your model validation system

Using adversarial validation

Example implementation

This implementation is based on this competition

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import roc_auc_score
train = pd.read_csv("../input/tabular-playground-series-jan-2021/train.csv")
test = pd.read_csv("../input/tabular-playground-series-jan-2021/test.csv")

# Data preparation is short and to the point. Since all features are numeric, 
# you won’t need any label encoding, but you do have to fill any missing values
# with a negative number (-1 usually works fine), and drop the target and also any identifiers.

train = train.fillna(-1).drop(["id", "target"], axis=1)
test = test.fillna(-1).drop(["id", axis=1])
X = train.append(test)
y = [0] * len(train) + [1] * len(test)

# At this point, you just need to generate `RandomForestClassifier` predictions 
# for your data using the `cross_val_predict` function, which automatically
# creates a cross-validation scheme and stores the predictions on the validation fold:

model = RandomForestClassifier()
cv_preds = cross_val_predict(model, X, y, cv=5, n_jobs=-1, method='predict_proba')

# As a result, you obtain predictions that are unbiased (they are not overfit as
# you did not predict on what you trained) and that can be used for error
# estimation. Please note that `cross_val_predict` won’t fit your instantiated
# model, so you won’t get any information from it, such as what the important
# features used by the model are. If you need such information, you just need to
# fit it first by calling `model.fit(X, y)`.

print(roc_auc_score(y_true=y, y_score=cv_preds[:,1]))

# You should obtain a value of around 0.49-0.50 (`cross_val_predict` won’t be
# deterministic unless you use cross-validation with a fixed `random_seed`).

Handling different distributions of training and test data

Suppression

Train on the examples most similar to test set

print(np.sum(cv_preds[:len(X), 1] > 0.5))

Validating by mimicking the test set

model.fit(X, y)
ranks = sorted(list(zip(X.columns, model.feature_importances_)), 
               key=lambda x: x[1], reverse=True)
for feature, score in ranks:
    print(f"{feature:10} : {score:0.4f}")

Concluding remarks on adversarial validation

Handling leakage

Feature leakage

Posterior information

Training example leakage