Source

Building and evaluating random forest

Step 1

Create a “bootstrapped” dataset.

Step 2

Create a decision tree using the bootstrapped dataset, but only use a random subset of variables (or columns) at each step.

Now go back to step 1 and repeat

Make a new bootstrapped dataset and build a tree considering a subset of variables at each step.

Now that we’ve created a random forest, how do we use it?

Estimate accuracy: How do we know if it’s any good?

Choosing the number of subset variables

Missing data and sample clustering

Random forest consider two types of missing data:

  1. Missing data in the original dataset used to create the random forest.
  2. Missing data in a new sample that you want to categorize.

Missing data in the original dataset

The general idea for dealing with missing data in this context is to make an initial guess that could be bad, then gradually refine the guess until it is (hopefully) a good guess.

The initial guess

Based on the label for that row of data that has missing values, we make initial guesses for the columns with missing values based on the most common value (categorical) or median (for continuous) of the non-missing data with the same label.

Screen Shot 2021-12-11 at 5.54.54 PM

Refine initial guess

We do this by first determining which samples are similar to the ones with missing data.

How to determine similarity?

  1. Build a random forest.
  2. Run all of the data down all of the trees
  3. For each tree:
    • If two samples (rows of data) end up in the same leaf node, they’re similar.
    • For each tree, we find similar samples and add +1 to samples that are similar.
    • Ultimately, we run the data down all the trees and the proximity matrix fills in.
  4. Then, we divide each proximity value by the total number of trees.
  5. Now, we replace the missing values using the proximity values.
    • Here, proximity numbers are used as weights to calculate the missing value.
    • Note: we’ll normalize the value by diving it by the sum of all proximity values for that sample. (see examples below for categorical and numeric values)
    • Note: for categorical values we calculate the weight for each value and choose the one with the highest weighted frequency (see image below).
    • Note: For numeric values, it’s just the weighted sum of non-missing values where the weights are proximity values (see below).
  6. We do the whole thing over again (i.e. 1. Build a random forest, 2. Run the data through the trees, 3. Recalculate the proximity and missing values), until the (guessed) missing values converge.
Screen Shot 2021-12-13 at 12.35.29 PM
Screen Shot 2021-12-13 at 12.33.03 PM
Screen Shot 2021-12-13 at 12.35.29 PM
Screen Shot 2021-12-13 at 1.17.20 PM

Distance matrix

We can do something cool with the proximity matrix. When two samples end up in the same leaf node for all the trees, their (normalized) proximity score is 1, which means the samples are as close as can be. That means 1proximity=distance1 - \text{proximity} = \text{distance} (i.e. close as can be = no distance between). We can draw a heatmap based on a distance matrix. We can also draw an MDS plot.

The cool thing about this is that no matter what the data are (ranks, multiple-choice, numeric, etc.), if we can use it to make a tree, we can draw a heatmap (or an MDS plot) to show how the samples are related to each other.

Screen Shot 2021-12-13 at 2.53.14 PM

Missing data in a new sample

In this case, since we don’t have the labels, we have to do this: