modeling_for_tabular_competitions.md

CTGAN

There’s a GAN model called CTGAN which can be used for generating synthetic data.
- CTGAN Github
- CTGAN works by modeling the probability distribution of rows in tabular data and then generating realistic synthetic data, paper
- The Synthetic Data Vault by MIT has CTGAN and few other tools around it.
- Note: Think about the technology that Kaggle used to generate the data. If you can properly understand how the data has been generated, you get an important advantage. It gives you a way to easily obtain more varied data for training. Example
  - Note: Keep in mind that understanding data distribution is no easy task, check out this notebook for more explanation.

Reproducability

You’d want to maintain reproducability and save all:
- models (from every fold)
- the list of parameters used
- all the fold predictions
- all the out-of-fold predictions
- all the predictions from all the models
You could use a simple .txt file or an Excel file to keep track of things. But, there are some tools out there that you could use:
- DVC
- Weights and Biases
- MLflow
- Neptune

Setting random state for reproducability

We have to set the seed number so that we get the same number every time we run the code.
- The same random seed corresponds to the same sequence of random numbers.
For sklearn models, we could use the built-in random_state.
For TensorFlow or PyTorch, we could use the following function:

def seed_everything(seed, 
                    tensorflow_init=True, 
                    pytorch_init=True):
    """
    Seeds basic parameters for reproducibility of results
    """
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    if tensorflow_init is True:
        tf.random.set_seed(seed)
    if pytorch_init is True:
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

EDA

In an EDA, you’ll look for:

Missing values and, most importantly, missing value patterns correlated with the target.
Skewed numeric variables and their possible transformations.
Rare categories in categorical variables that can be grouped together.
Potential outliers, both univariate and multivariate.
Highly correlated (or even duplicated) features. For categorical variables, focus on categories that overlap.
The most predictive features for the problem.

To simplify things and save some coding time, you could also use EDA tools:

AutoViz: AutoViz
- Understanding what AutoViz can do
- AutoViz notebook
Sweetviz: Sweetviz
- Sweetviz overview
Pandas Profiling: Pandas Profiling
- Intro to Pandas Profiling
Note: Obviously, you could also use other Kagglers EDA notebooks.

Always make sure to do your own EDA: Remember that EDA stops being a commodity and becomes an asset for the competition when it is highly specific to the problem at hand; this is something that you will never find from automated solutions and seldom in public Notebooks. You have to do your EDA by yourself and gather key, winning insights.

Dimensionality reduction

Make sure to always consider using these dimensionality reduction techniques. They can be pretty helpful in identifying outliers and presence of relevant clusters in the data.

t-SNE
UMAP
Note: Plot the scatter graph of 2-D projection and color it by target value.
- Example: A good example of using t-SNE in an image competition here.
Note: You can use them as features in your modeling effort.
- Example
Note: t-SNE and UMAP are more revealing than the classical methods based on variance restructuring by linear combination such as PCA or SVD.
- Compared to these approaches, UMAP and t-SNE manage to reduce the dimensionality extremely, allowing visual charting of the results while maintaining the topography of the data.
- The downside is they’re much slower to fit.
  - Note: Nvidia has released RAPIDS suite based on CUDA which returns the result in a reasonble timeframe. A notebook example of using RAPIDS. Another example.

Implementation tips:
- How to use t-SNE effectively? Read the article here
- Understanding UMAP: Read the article here

Reduing data size, avoiding out-of-memory error

Unlike deep learning, where data is fed in batches, most of the algorithms that work with tabular data require handling all the data in memory.
The most common situation is when read the data using Pandas read_csv but the dataframe is too large.
The solution is to compress the size without losing any information, lossless compression.
This can be achieved using the following script:

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 
                'float16', 'float32', 'float64']'
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
	                df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

Note: Our suggestion is to apply it after feature engineering or before major transformations that do not rescale your existing data.

Note: Combining this with garbage collection library gc and the gc.collect() will improve the memory situation.

Note: Another way to reduce data size, it feature engineering and feature selection.

Feature engineering

Mostly, the real differentiator is not just the lack of missing values, and reliability of the values (its quality), or the number of examples (its quantity).
- The real differentiator is mostly the informational value of the content itself, which is represented by the type of features.
- Models only make apparent the value in data. They are not magic in themselves.

Easily derived features

Here are the most common transformations to try:

Time feature processing:
- Extracting day of the week, time of the day, month number, etc. from datetime variables.
- Cyclic continuous transformations (based on sine and cosine transformations) are also useful for representing the continuity of time and creating periodic features:

cycle = 7
df['weekday_sin'] = np.sin(2 * np.pi * df['col1'].dt.dayofweek / cycle)
df['weekday_cos'] = np.cos(2 * np.pi * df['col1'].dt.dayofweek / cycle)

Numeric feature transformations:
- Scaling: obtained by standardization (the z-score method)
- Normalization: also called min-max scaling
- Logarithmic or exponential transformations
- Separating the integer and decimal parts
- Summing, subtracting, multiplying, or dividing two numeric features
Binning of numeric features:
- This is used to transform continuous variables into discrete ones by distributing their values into a number of bins.
- Binning helps remove noise and errors in data and it allows easy modeling of non-linear relationships between the binned features and the target variable when paired with one-hot encoding. See the Scikit-learn implementation.
Categorical feature encoding:
- One-hot encoding
Splitting and aggregating categorical features based on levels:
- See this example.
Polynomial features:
- In Scikit-learn
Missing values treatment:
- Make binary features that point out missing values, because sometimes missingness is not random and a missing value could have some important reason behind it.
- Usually, missingness points out something about the way data is recorded, acting like a proxy variable for something else.
- If required by your learning algorithm, replace the missing values with the mean, median, or mode (it is seldom necessary to use methods that are more sophisticated).
- ***A guide to handling missing values in Python:*** Link here
- Note: Just keep in mind that some models can handle missing values by themselves and do so fairly better than many standard approaches, because the missing-values handling is part of their optimization procedure. The models that can handle missing values by themselves are all gradient boosting models:
  - XGBoost, read more
  - LightGBM, read more
  - CatBoost, read more
Outlier capping or removal:
- Exclude, cap to a maximum or minimum value, or modify outlier values in your data.
- Outlier detection in Scikit-learn.
- Note: Otherwise, you can simply locate the outlying samples in a univariate fashion, basing your judgment on how many standard deviations they are from the mean, or their distance from the boundaries of the interquartile range (IQR).
  - In this case, you might simply exclude any points that are above the value of 1.5 * IQR + Q3 (upper outliers) and any points that are below Q1 - 1.5 * IQR (lower outliers).
  - Once you have found the outliers, you can also proceed by pointing them out with a binary variable.

Note: All these data transformations can add predictive performance to your models, but they are seldom decisive in a competition.

Meta-features based on rows and columns

For competitions, you need trickier feature engineering.
Meta features help to distinguish the different kinds of samples found in your data by pointing out specific groups of samples to your algorithm.
A good place to start is looking at features based on each row:
- Compute the mean, median, sum, standard deviation, minimum, or maximum of the numeric values (or of a subset of them)
- Count the missing values
- Compute the frequencies of common values found in the rows (for instance, considering the binary features and counting the positive values)
- Assign each row to a cluster derived from a cluster analysis such as k-means.
Meta features are also made based on columns.
- Aggregation and summarization operations on a single feature
- Is this characteristic common or rare? (for example in counting different categories in a feature).
- You can use any kind of column statistic: mode, median, mean, sum, standard deviation, min, max, skewness, kurtosis.
- There are other different ways:
  - Frequency encoding: Count of frequency for categorical features (and replace categorical feature with its frequency count).
  - Frequencies (and column statistics) computed with respect to a relevant group: The groupby operation.
    - The group could be coming from cluster analysis, or one of the current features.
  - Example notebook.

Target encoding

Encoding categorical features can be done using Scikit-learn:
- LabelEncoder
- OneHotEncoder
- OrdinalEncoder
Note: When the number of categories are too large, one-hot encoding becomes sparse and cumbersome to handle in memory. These are high-cardinality-features which require special handling.
- You could use an encoding function that is computed according to the Micci-Barreca paper: A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems (2001).
  - The idea is to transform many categories of a categorical features into their corresponding expected target value.
  - In case of regression, this is the average expected value for that category.
  - For binary classification, it is the conditional probability given that category.
  - For multiclass classification, the conditional probability for each possible outcome.
  - This way categorical feature is transformed into a numeric one without having to convert the data into a larger and sparser dataset.
  - This is target encoding and it is indeed very effective in many situation.
  - Note: When some categories are too rare, using target encoding is almost equivalent to providing the target label. There are ways to avoid this. The solution is to blend the observed posterior probability on that level (the probability of the target given a certain value of the encoded feature) with the a priori probability (the probability of the target observed on the entire sample) using a lambda factor. This is called empirical Bayesian approach.
  - In practical terms, we are using a function to determine if, for a given level of a categorical variable, we are going to use the conditional target value, the average target value, or a blend of the two.
  - This is dictated by the lambda factor, which, for a fixed k parameter (usually it has a unit value, implying a minimum cell frequency of two samples) has different output values depending on the f value that we choose.
  - For a fixed k, higher values of f dictate less trust in the observed empirical frequency and more reliance on the empirical probability for all cells.
  - The right value for f is usually a matter of testing (supported by cross-validation), since you can consider the f parameter a hyperparameter in itself.

From: PetFinder.my Adoption Prediction

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class TargetEncode(BaseEstimator, TransformerMixin):
    
    def __init__(self, categories='auto', k=1, f=1, 
                 noise_level=0, random_state=None):
        if type(categories)==str and categories!='auto':
            self.categories = [categories]
        else:
            self.categories = categories
        self.k = k
        self.f = f
        self.noise_level = noise_level
        self.encodings = dict()
        self.prior = None
        self.random_state = random_state
        
    def add_noise(self, series, noise_level):
        return series * (1 + noise_level *   
                         np.random.randn(len(series)))
        
    def fit(self, X, y=None):
		if type(self.categories)=='auto':
            self.categories = np.where(X.dtypes == type(object()))[0]
        temp = X.loc[:, self.categories].copy()
        temp['target'] = y
        self.prior = np.mean(y)
        for variable in self.categories:
            avg = (temp.groupby(by=variable)['target']
                       .agg(['mean', 'count']))
            # Compute smoothing 
            smoothing = (1 / (1 + np.exp(-(avg['count'] - self.k) /                 
                         self.f)))
            # The bigger the count the less full_avg is accounted
            self.encodings[variable] = dict(self.prior * (1 -  
                             smoothing) + avg['mean'] * smoothing)
            
        return self
    
    def transform(self, X):
        Xt = X.copy()
        for variable in self.categories:
            Xt[variable].replace(self.encodings[variable], 
                                 inplace=True)
            unknown_value = {value:self.prior for value in 
                             X[variable].unique() 
                             if value not in 
                             self.encodings[variable].keys()}
            if len(unknown_value) > 0:
                Xt[variable].replace(unknown_value, inplace=True)
            Xt[variable] = Xt[variable].astype(float)
            if self.noise_level > 0:
                if self.random_state is not None:
	                np.random.seed(self.random_state)
                Xt[variable] = self.add_noise(Xt[variable], 
                                              self.noise_level)
        return Xt
    
    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.transform(X)

# How to use the class
te = TargetEncode(categories='ROLE_TITLE')
te.fit(train, train['ACTION'])
te.transform(train[['ROLE_TITLE']])

The input parameters of the function are:

categories: The column names of the features you want to target-encode. You can leave auto on and the class will pick the object strings.
k (int): Minimum number of samples to take a category average into account.
f (int): Smoothing effect to balance the category average versus the prior probability, or the mean value relative to all the training examples.
noise_level: The amount of noise you want to add to the target encoding in order to avoid overfitting. Start with very small numbers.
random_state: The reproducibility seed in order to replicate the same target encoding when noise_level > 0.

Note: Instead writing your own code, you could also use this library: category_encoders and its Target Encoder

Using feature importance

Applying too much feature engineering can have side effects.
- Each variable carries some noise. With too many variables, you’re increasing the chance that model picks up on noise instead of signal.
Only keep the relevant features.
Figuring out the features you need to keep is a hard problem. As the number of features grows, the number of possible combinations grows as well.
Do the feature selection at end (after feature engineering) once you have all the features.

How to select features?

Classical approach $\rightarrow$ forward addition or backward elimination. This is quite time-consuming.
For regression models, using lasso selection can provide a hint about all the important yet correlated features by using stability selection. (read more)
- Note: The procedure may, in fact, retain even highly correlated features.
For tree-based models (random forests, gradient boosting), a descrease in impurity or a gain in the target metric based on splits are common ways to rank features.
Always for tree-based models, test-based randomization of features (or simple comparisons with random features) helps to distinguish features. Example LSTM, Example Boruta, Example BorutaShap.
- Note: Boruta or BorutaShap may take up to 100 iterations and it can only be performed using tree-based machine learning algorithms.
- Note: If you are selecting features for a linear model, Boruta may actually overshoot. This is because it will consider the features important both for their main effects and their interactions together with other features (but in a linear model, you care only about the main effects and a selected subset of interactions).
- Note: You can still effectively use Boruta when selecting for a linear model by using a gradient boosting whose max depth is set to one tree, so you are considering only the main effects of the features and not their interactions.
- A BorutaShap feature selection notebook

Pseudo-labeling

In competitions where the number of examples used for training can make a difference, pseudo-labeling can boost your scores by providing further examples taken from the test set.
The idea is to add examples from the test set whose predictions you are confident about to your training set.
Pseudo-labeling simply helps models to refine their coefficients thanks to more data available,
Pseudo-labeling was first introduced in the Santander Customer Transaction Prediction competition by one of the team, notebook.
Note: Pseudo-labeling won’t always work.
- You cannot know for sure beforehand whether or not pseudo-labeling will work in a competition. You have to test it empricially.
  - Plotting learning curves may provide you with a hint as to whether having more data could be useful, example.
- It is not easy to decide which parts of the test set predictions to add or how to tune the entire procedure for the best results.
Generally, the procedure is like this:
1. Train your model
2. Predict on the test set
3. Establish a confidence measure
4. Select the test set elements to add
5. Build a new model with the combined data
6. Predict using this model and submit

A good example of the complete procedure.

There a few caveats when applying pseudo-labeling:
- You should have a very good model that produces good predictions for them to be usable in training. Otherwise, you will just add more noise.
- Since it is impossible to have entirely perfect predictions in the test set, you need to distinguish the good ones from the ones you shouldn’t use. If you are predicting using CV folds, check the standard deviation of your predictions and pick only the test examples where the standard deviation is the lowest.
  - If you are predicting probabilities, use only high-end or low-end predicted probabilities (the cases where the model is actually more confident).
- In the second stage, when you concatenate the training examples with the test ones, do not put in more than 50% test examples.
  - Ideally, a share of 70% original training examples and 30% pseudo-labeled examples is the best.
- If you depend on validation for early stopping, fixing hyperparameters, or simply evaluating your model, do not use pseudo-labels in the validation.
- If possible, use a different kind of model when training to estimate the pseudo-labels and when training your final model using both the original labels and the pseudo-labels. This will ensure you are not simply enforcing the same information your previous model used, but you are also extracting new information from the pseudo-labels.

Denoising with autoencoders

Non-linear data compression, image denoising.
This post explains how a DAE can not only remove noise but also automatically create new features, so the representation of the features is learned in a similar way to what happens in image competitions.
- Note: In the post, he mentions the secret sauce for the DAE recipe, which is not simply the layers, but the noise you put into the data in order to augment it.
- Note: He also made clear that the technique requires stacking together training and test data, implying that the technique would not have applications beyond winning a Kaggle competition.
There are two types of DAEs:
- In bottleneck DAEs, mimicking the approach used in image processing, you take as new features the activations from the middle layer, the one separating the encoding part from the decoding part. These architectures have an hourglass shape, first reducing the number of neurons layer by layer until the middle bottleneck layer, then enlarging it back in the second part. The number of hidden layers is always odd.
- In deep stack DAEs, you take all the activations from the hidden layers, without distinguishing between the encoding, decoding, or middle layer. In these architectures, layers are the same size. The number of hidden layers can be even or odd.
Random noise: In order to help train any kind of DAE, you need to inject noise that helps to augment the training data and avoid the overparameterized neural network just memorizing inputs (in other words, overfitting).
- In the Porto Seguro competition, Michael Jahrer added noise by using a technique called swap noise, which he described as follows:
- Here I sample from the feature itself with a certain probability “inputSwapNoise” in the table above. 0.15 means 15% of features replaced by values from another row.
- What is described is basically an augmentation technique called mixup (which is also used in image augmentation. Read more
- In mixup for tabular data, you decide a probability for mixing up. Based on that probability, you change some of the original values in a sample, replacing them with values from a more or less similar sample from the same training data. A walkthrough example of mixup
  - In column-wise noise swapping, you swap values in a certain number of columns. The proportion of columns whose values are to be swapped is decided based on your mixup probability.
  - In row-wise noise swapping, you always swap a certain number of the values in each row. Essentially, every row contains the same proportion of swapped values, based on the mixup probability, but the features swapped change from row to row.
  - In random noise swapping, you fix a number of values to be swapped, based on the mixup probability, and you randomly pick them up from the entire dataset (this is somewhat similar to row-wise swapping in effect).

Important factors to keep an eye on when working with DAEs:

Architecture of the DAE (deep stack tends to work better, but you need to determine the number of units per layer and the number of layers)
Learning rate and batch size
Loss (also distinguishing between the loss of numeric and categorical features helps)
Stopping point (the lowest loss is not always the best; use validation and early stopping if possible)

Examples of recent DAE implementations

==Note: If you don’t want to spend too much time building your own DAE, but you would like to explore whether something like it could work for the competition you are taking on, you can test out a couple of pre-prepared solutions. First, you can refer to this notebook, and re-adapt it to your needs.

Or, you can use this library from one the Kagglers.

Neural networks for tabular competitions

Gradient boosting solutions still clearly dominate tabular competitions (as well as real-world projects).
- However, sometimes neural networks can catch signals that gradient boosting models cannot get, and can be excellent single models or models that shine in an ensemble.
Note: As many Grandmasters of the present and the past often quote, mixing together diverse models (such as a neural network and a gradient boosting model) always produces better results than single models taken separately in a tabular data problem. Check out this video from former number one on Kaggle regarding this.
Example 1: deep learning for tabular data
Example 2: TensorFlow for tabular data

The key things to take into account when building these solutions are:

Use activations such as GeLU, SeLU, or Mish instead of ReLU; they are quoted in quite a few papers as being more suitable for modeling tabular data and our own experience confirms that they tend to perform better.
Experiment with batch size.
Use augmentation with mixup (discussed in the section on autoencoders).
Use quantile transformation on numeric features and force, as a result, uniform or Gaussian distributions.
Leverage embedding layers, but also remember that embeddings do not model everything. In fact, they miss interactions between the embedded feature and all the others (so you have to force these interactions into the network with direct feature engineering).
- In particular, remember that embedding layers are reusable.

Out-of-the-box solutions
If you don’t want to build your own deep neural network in TensorFlow or PyTorch, you can rely on a few out-of-the-box architectural solutions. Here are the main ones you can try when taking on a tabular competition yourself:

TabNet: Developed by Google researchers (2020).
- Paper
- Implementation 1 using pytorch-tabnet package
- Implementation 2
Neural Oblivious Decision Ensembles (NODE):
Factorization machines: You can use a wide range of models, such as Wide & Deep, DeepFM, xDeepFM, AutoInt, and many others based on factorization machines and mostly devised for click-through rate estimation.
- DeepCTR
- DeepTables
Note: In conclusion, you can build your own neural network for tabular data by mixing together embedding layers for categorical features and dense layers for numeric ones.
Note: Always be on the lookout for a new package appearing.
Note: Don’t expect a neural network to be the best model in a tabular competition; this seldom happens.
- Instead, blend solutions from classical tabular data models, such as gradient boosting models and neural networks, because they tend to pick up different signals from the data that you can integrate together in an ensemble.