Intuitive Explanation

t-SNE takes a high dimensional dataset and reduces it to a low dimensional graph that retains a lot of the original information.

How does t-SNE work?

Let’s see how t-SNE transforms this 2-D scatter plot into a 1-D plot on a number line.

Note: if we just projected the data onto one of the axes, we’d just get a big mess that doesn’t preserve the original clustering.

What t-SNE does is find a way to project data into a low dimensional space so that the clustering in the high dimensional space is preserved.

How doest t-SNE do that?

We’ll start with the original 2-D scatter plot.
Then, we’ll put the points on the number line in a random order.
From here on out, t-SNE moves these points, a little bit at a time, until it has clustered them.

How does t-SNE move data points?

Let’s start with the first data point on the left? How does t-SNE decide where to move it (right or left) and by how much?
- Because it is part of the “red” clusters, it wants to move closer to the other red dots on the number line.
- But, at the same time, “blue” and “yellow” points are far away in the scatter plot, so they push back.
- Basically, red points attract, and other points repel.
- In this case, the attraction is strongest, so the first red dot on the left moves a little to the right.

At each step, a point on the line is attracted to points it is near in the scatter plot, and repelled by points it is far from.

t-SNE Steps

Determine the “similarity” of all points in the scatter plot. How t-SNE determines similarity?
1. Measure the distance between two points.
2. Plot that distance on a normal curve that is centered on the point of interest.
3. Draw a line from the point to the curve. The length of that line is the "unscaled similarity".
4. Note: The width of the normal curve depends on the density of data near the point of interest. Less dense regions have wider curves.
5. Note: Using a normal distribution means that distant points have very low similarity values, and close points have high similarity values.

Once we measure the distance between all the points, we plot them on a normal curve and measure the distances from the points to the curve to get the unscaled similarity scores with respect to the point of interest.

Scale the unscaled similarities so they add up to 1.
- Why is has to add up to 1? The difference in density of data points (which results in differences in the standard deviation of the normal curve) can result in similarities getting under/over-estimated (see image below). Similarity score won’t be much affected by the cluster’s density this way.
- How to scale it? $\frac{\small{\text{score}}}{\small{\text{sum of all scores}}} = \text{scaled score}$ .
- Perplexity parameter: In the t-SNE’s actual formulation, it has a “perplexity parameter” equal to the expected density around each point.
- Note: Because the width of the normal distribution is based on the density of the surrounding data points, the similarity between the two points is not symmetrical (each direction might have a different similarity). t-SNE just averages the two similarity scores.

Ultimately, you end up with a matrix of similarity scores. Each row and column represents the similarity scores calculated from that point of interest.
- Notes: t-SNE defines the similarity score of each point to itself (the diagonal) as zero since that similarity score doesn’t really help with the clustering.

Once, we’re done with calculating the similarity scores, we project data points into a lower-dimensional space, and calculate similarity scores for the points on the lower-dimensional space.
- That means picking a point, measuring a distance, and lastly drawing a line from the point to a curve. However, this time we’re using a t-distribution.
- t-Distribution is a lot like a normal distribution, except that the t-distribution is as tall in the middle, and the tails are taller (fatter) on the ends.
- Why t-distribution? Without it, clusters will all clump up in the middle and be harder to see.
Using t-distribution we calculate unscaled similarity scores for all the points and then scale them like before. Like before, we end up with a matrix of similarity scores, but this matrix is a mess (see image below) compared to the original matrix.
- The goal of moving points is to make two similarity matrices as similar as possible.
- t-SNE uses small movements because it is a little like a chess game and can’t be solved all at once. It goes one move at a time.

Formal Explanation

Source.

t-SNE stands for t-distributed Stochastic Neighborhood Embedding.
t-SNE is a dimension reduction/data visualization method, proposed by Maaten & Hinton in 2008.
t-SNE tends to preserve local structure at the same time preserving the global structure as much as possible.
- Note: Other methods like PCA or Multi-Dimensional Scaling (MDS) try to preserve the global structure and in that process they lose the local structure.

What is SNE?

In SNE the aim is to match distributions of distances between points in high and low dimensional space via conditional probabilities. SNE assumes distances in both high and low dimensional spaces are Gaussian-distributed.

Let $x_i$ be the $i^{th}$ object in high dimensional space and $y_i$ be the $i^{th}$ object in low dimensional space.
You construct the conditional probability of the similarity between points $i$ and $j$ , $p_{j|i}$ , as follows,

$p_{j|i} = \frac{e^{\large{{-\frac{||x_i - x_j||^2}{2\sigma^2_i}}}}}{\sum\limits_{k \neq i}{e^{\large{{-\frac{||x_i - x_k||^2}{2\sigma^2_i}}}}}}$

Note: This looks like a kernel of normal distribution.
Note: If you have two data points that are very close to each other (i.e. $||x_i - x_j||$ is very small), then their similarity are going to be closer to $1$ (since $\underset{x \rightarrow 0}{\large{e}} \rightarrow 1$ ), and visa versa for when the points are far apart.

$q_{j|i} = \frac{e^{-||y_i - y_j||^2}}{\sum\limits_{k \neq i}{e^{-||y_i - y_k||^2}}}$

$q_{j|i}$ is the conditional probability similarity in the low dimensional space.
In $q_{j|i}$ , it’s assumed that $\sigma = \frac{1}{\sqrt{2}}$ (or $2\sigma^2 = 1$ ).
Note: It’s also assumed that $p_{i|i} = q_{i|i} = 0$ (since self-similarity is not informative in this case).

The objective is to reduce the difference between $p_{i|i}$ and $q_{i|i}$ such that when you project the high dimensional data into low dimensional space, it looks as similar to high dimensional as possible.

This goal is achieved by minimizing the sum of Kullback-Leibler divergences:

$C = \sum\limits_i KL(P_i||Q_i) = \sum\limits_i \sum\limits_j p_{j|i} \log(\frac{p_{j|i}}{q_{j|i}})$

Since KL divergence is asymmetric,

Large cost for representing nearby data points in high dimensional map by widely separated points in the low dimensional map.
Smaller cost for representing widely separated data points in high dimensional map by nearby points in the low dimension.

Hence the local structure is highly preserved.

$\sigma_i$ is associated with a parameter called perplexity which can be loosely interpreted as the number of close neighbors each point has.

$\sigma_i$ is found via binary search.

The gradient of cost function is,

$\frac{\partial C}{\partial y_i} = 2 \sum\limits_j (p_{j|i} - q_{j|i} + p_{i|j} - q_{i|j})(y_i - y_j)$

SNE uses gradient descent for optimization.
In case SNE, in addition to the gradient of the cost function it also has a momentum term to speed up the optimization and to avoid local optima.

$y^{(t)} = y^{(t-1)} + \eta \frac{\partial C}{\partial y} + \alpha(t)(y^{(t-1)} - y^{(t-2)})$

where

$\alpha(t)$ : momentum at iteration $t$
$y^{(t)}$ : solution at iteration $t$
$\eta$ : learning rate

SNE has two main drawbacks:

Cost function is difficult to optimize
Crowding problem
- If there are points are moderately far apart, and there are points are nearby, SNE tends to clump all of them together.

Novel features in t-SNE:

t-SNE cost function has two distinct features:

Cost function is a symmetrized version of that in SNE, i.e. ( $p_{j|i} = p_{i|j} \quad q_{j|i} = q_{i|j}$ )
Student t-distribution is used to compute the similarities between data points in the low dimensional space.

The main feature in symmetric SNE is that $p_{j|i} = p_{i|j}$ and $p_{ii} = q_{ii} = 0$ .

The gradient of cost function is,

$\frac{\partial C}{\partial y_i} = 4 \sum\limits_j (p_{ji} - q_{ji})(y_i - y_j)$

In t-SNE a student t-distribution with one degree of freedom (Cauchy distribution) is used to represent the low dimensional map:

$q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum\limits_{k \neq i}(1 + ||y_k - y_i||^2)^{-1}}$

t-distribution is robust to outliers and unlike Gaussian distribution it doesn’t have exponent in it so faster to evaluate. The gradient of the cost function becomes,

$\frac{\partial C}{\partial y_i} = 4 \sum\limits_j (p_{ji} - q_{ji})(y_i - y_j)(1 + ||y_i - y_j||^2)^{-1}$

t-SNE Algorithm

The general idea of t-SNE algorithm:

Data: data set $\mathbb{X}$ with data points $\{x_1, x_2, \dots, x_n\}$ with each of these points have very high dimension
Cost function parameter, Perplexity (Perp): which is associated with variance ( $\sigma$ ) in the cost function.
Optimization parameters: number of iterations ( $T$ ), learning rate ( $\eta$ ), momentum $\alpha(t)$ .

BEGIN

Compute pairwise affinities $p_{j|i}$ with perplexity Perp
Set $p_{ij} = \frac{p_{i|j} + p_{j|i}}{2n}$ , where $n$ = number of data points
Sample initial solution $Y^{(0)} = y_1, y_2, \dots, y_n$ from $N(0, 10^{-4}I)$
- For $t=1$ to $T$ :
  - Compute low-dimensional affinities, $q_{ij}$
  - Compute gradient $\frac{\partial C}{\partial y}$
  - Set $y^{(t)} = y^{(t-1)} + \eta \frac{\partial C}{\partial y} + \alpha(t)(y^{(t-1)} - y^{(t-2)})$
- END
END

Criticisms of t-SNE

There is a great paper by Wattenberg, et. al “How to Use t-SNE Effectively”, that explains some of the drawbacks of t-SNE:

Different perplexity can lead to completely different clusters (too small $\rightarrow$ location variations dominate, too large $\rightarrow$ global change dominate).
Cluster size doesn’t have any meaning to it. Naturally expands dense cluster and contracts sparse clusters.
Distance between clusters might not have clear interpretation.
Sometimes random noise can lead to false positive structure in the t-SNE projection.

Summary

t-SNE is an incredibly successful tool for clustering and data visualization.
It provides better structure for very high dimensional data
However, higher flexibility leads to other drawbacks like lack of interpretability.
Not very intuitive to tune the parameters (perplexity, iterations, tolerance, etc.)