CTGAN

Reproducability

Setting random state for reproducability

def seed_everything(seed, 
                    tensorflow_init=True, 
                    pytorch_init=True):
    """
    Seeds basic parameters for reproducibility of results
    """
    random.seed(seed)
    os.environ["PYTHONHASHSEED"] = str(seed)
    np.random.seed(seed)
    if tensorflow_init is True:
        tf.random.set_seed(seed)
    if pytorch_init is True:
        torch.manual_seed(seed)
        torch.cuda.manual_seed(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

EDA

In an EDA, you’ll look for:

To simplify things and save some coding time, you could also use EDA tools:

Always make sure to do your own EDA: Remember that EDA stops being a commodity and becomes an asset for the competition when it is highly specific to the problem at hand; this is something that you will never find from automated solutions and seldom in public Notebooks. You have to do your EDA by yourself and gather key, winning insights.

Dimensionality reduction

Make sure to always consider using these dimensionality reduction techniques. They can be pretty helpful in identifying outliers and presence of relevant clusters in the data.

Implementation tips:
- How to use t-SNE effectively? Read the article here
- Understanding UMAP: Read the article here

Reduing data size, avoiding out-of-memory error

def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 
                'float16', 'float32', 'float64']'
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
	                df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df        

Note: Our suggestion is to apply it after feature engineering or before major transformations that do not rescale your existing data.

Note: Combining this with garbage collection library gc and the gc.collect() will improve the memory situation.

Note: Another way to reduce data size, it feature engineering and feature selection.

Feature engineering

Easily derived features

Here are the most common transformations to try:

cycle = 7
df['weekday_sin'] = np.sin(2 * np.pi * df['col1'].dt.dayofweek / cycle)
df['weekday_cos'] = np.cos(2 * np.pi * df['col1'].dt.dayofweek / cycle)

Note: All these data transformations can add predictive performance to your models, but they are seldom decisive in a competition.

Meta-features based on rows and columns

Target encoding

From: PetFinder.my Adoption Prediction

import numpy as np
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin
class TargetEncode(BaseEstimator, TransformerMixin):
    
    def __init__(self, categories='auto', k=1, f=1, 
                 noise_level=0, random_state=None):
        if type(categories)==str and categories!='auto':
            self.categories = [categories]
        else:
            self.categories = categories
        self.k = k
        self.f = f
        self.noise_level = noise_level
        self.encodings = dict()
        self.prior = None
        self.random_state = random_state
        
    def add_noise(self, series, noise_level):
        return series * (1 + noise_level *   
                         np.random.randn(len(series)))
        
    def fit(self, X, y=None):
		if type(self.categories)=='auto':
            self.categories = np.where(X.dtypes == type(object()))[0]
        temp = X.loc[:, self.categories].copy()
        temp['target'] = y
        self.prior = np.mean(y)
        for variable in self.categories:
            avg = (temp.groupby(by=variable)['target']
                       .agg(['mean', 'count']))
            # Compute smoothing 
            smoothing = (1 / (1 + np.exp(-(avg['count'] - self.k) /                 
                         self.f)))
            # The bigger the count the less full_avg is accounted
            self.encodings[variable] = dict(self.prior * (1 -  
                             smoothing) + avg['mean'] * smoothing)
            
        return self
    
    def transform(self, X):
        Xt = X.copy()
        for variable in self.categories:
            Xt[variable].replace(self.encodings[variable], 
                                 inplace=True)
            unknown_value = {value:self.prior for value in 
                             X[variable].unique() 
                             if value not in 
                             self.encodings[variable].keys()}
            if len(unknown_value) > 0:
                Xt[variable].replace(unknown_value, inplace=True)
            Xt[variable] = Xt[variable].astype(float)
            if self.noise_level > 0:
                if self.random_state is not None:
	                np.random.seed(self.random_state)
                Xt[variable] = self.add_noise(Xt[variable], 
                                              self.noise_level)
        return Xt
    
    def fit_transform(self, X, y=None):
        self.fit(X, y)
        return self.transform(X)

# How to use the class
te = TargetEncode(categories='ROLE_TITLE')
te.fit(train, train['ACTION'])
te.transform(train[['ROLE_TITLE']])

The input parameters of the function are:

Note: Instead writing your own code, you could also use this library: category_encoders and its Target Encoder

Using feature importance

How to select features?

Pseudo-labeling

A good example of the complete procedure.

Denoising with autoencoders

Important factors to keep an eye on when working with DAEs:

Examples of recent DAE implementations

==Note: If you don’t want to spend too much time building your own DAE, but you would like to explore whether something like it could work for the competition you are taking on, you can test out a couple of pre-prepared solutions. First, you can refer to this notebook, and re-adapt it to your needs.

Neural networks for tabular competitions

The key things to take into account when building these solutions are:

Out-of-the-box solutions
If you don’t want to build your own deep neural network in TensorFlow or PyTorch, you can rely on a few out-of-the-box architectural solutions. Here are the main ones you can try when taking on a tabular competition yourself: