Building and evaluating random forest

Random forest is made out of decision trees.
Decision trees are easy to build and interpret.
- But, in practice, they’re not that good.
- Trees have one aspect that prevents them from being the ideal tool for predictive learning, namely inaccuracy.
- In other words, they’re great with the data used to create them, but they’re not flexible when classifying new samples.
Random forest combines the simplicity of decision trees with flexibility resulting in a vast improvement in accuracy.

Step 1

Create a “bootstrapped” dataset.

To create a bootstrapped dataset that is the same size as the original, we just randomly select samples from the original dataset.
The important detail is that we’re allowed to pick the same sample more than once.

Step 2

Create a decision tree using the bootstrapped dataset, but only use a random subset of variables (or columns) at each step.

Note: We’ll discuss how to find the optimal number of variables to consider a little later.
We first pick, say 3, variables and find the best one to be the root node.
After selecting the root node variable, we leave that out, and pick another 3 from the rest of the variables, and we basically build the tree like that.
Note that in each step, we only use a subset of variables instead of all.

Now go back to step 1 and repeat

Make a new bootstrapped dataset and build a tree considering a subset of variables at each step.

Ideally, you’d do this 100’s of times.
Using a bootstrapped sample and considering only a subset of the variables at each step results in a wide variety of trees.
The variety is what makes the random forest more effective than individual trees.

Now that we’ve created a random forest, how do we use it?

We run each row of the data down all of the trees in the random forest, we see which option received more votes.
Bootstrapping the data plus using the aggregate to make a decision is called “Bagging”.

Estimate accuracy: How do we know if it’s any good?

When we created the bootstrapped dataset, we allowed duplicate entries. As a result, some entries may not get included in the bootstrapped dataset.
Typically, about 1/3 of the original data does not end up in the bootstrapped dataset. This is called the “Out-of-Bag Dataset”.
We run all the out-of-bag data through all the trees and find the votes for them.
Ultimately, we can measure how accurate our random forest is by the proportion of out-of-bag samples that were correctly classified by the random forest.
The proportion of out-of-bag samples that were incorrectly classified is the “out-of-bag error”.

Choosing the number of subset variables

Now that we know how to measure the accuracy of a random forest, we can build random forests with the different number of variables subsets and choose the number that yields the highest accuracy.
Typically, we start by using the square of the number of variables and then try a few settings above and below that value.

Missing data and sample clustering

Random forest consider two types of missing data:

Missing data in the original dataset used to create the random forest.
Missing data in a new sample that you want to categorize.

Missing data in the original dataset

The general idea for dealing with missing data in this context is to make an initial guess that could be bad, then gradually refine the guess until it is (hopefully) a good guess.

The initial guess

Based on the label for that row of data that has missing values, we make initial guesses for the columns with missing values based on the most common value (categorical) or median (for continuous) of the non-missing data with the same label.

Refine initial guess

We do this by first determining which samples are similar to the ones with missing data.

How to determine similarity?

Build a random forest.
Run all of the data down all of the trees
For each tree:
- If two samples (rows of data) end up in the same leaf node, they’re similar.
- For each tree, we find similar samples and add +1 to samples that are similar.
- Ultimately, we run the data down all the trees and the proximity matrix fills in.
Then, we divide each proximity value by the total number of trees.
Now, we replace the missing values using the proximity values.
- Here, proximity numbers are used as weights to calculate the missing value.
- Note: we’ll normalize the value by diving it by the sum of all proximity values for that sample. (see examples below for categorical and numeric values)
- Note: for categorical values we calculate the weight for each value and choose the one with the highest weighted frequency (see image below).
- Note: For numeric values, it’s just the weighted sum of non-missing values where the weights are proximity values (see below).
We do the whole thing over again (i.e. 1. Build a random forest, 2. Run the data through the trees, 3. Recalculate the proximity and missing values), until the (guessed) missing values converge.

Distance matrix

We can do something cool with the proximity matrix. When two samples end up in the same leaf node for all the trees, their (normalized) proximity score is 1, which means the samples are as close as can be. That means $1 - \text{proximity} = \text{distance}$ (i.e. close as can be = no distance between). We can draw a heatmap based on a distance matrix. We can also draw an MDS plot.

The cool thing about this is that no matter what the data are (ranks, multiple-choice, numeric, etc.), if we can use it to make a tree, we can draw a heatmap (or an MDS plot) to show how the samples are related to each other.

Missing data in a new sample

In this case, since we don’t have the labels, we have to do this:

The first thing is to create two (imagine binary categories) copies of the data with each of the labels.
Then, we use the iterative method from above to make a good guess about the missing values.
Once we filled out the missing values, we run the two samples down the random forests to see which of the two are correctly labeled by the random forest the most times.
- The one with more correctly classified instances wins.