Intuitive Explanation

t-SNE takes a high dimensional dataset and reduces it to a low dimensional graph that retains a lot of the original information.

How does t-SNE work?

Let’s see how t-SNE transforms this 2-D scatter plot into a 1-D plot on a number line.

Screen Shot 2022-01-24 at 11.29.01 AM

Note: if we just projected the data onto one of the axes, we’d just get a big mess that doesn’t preserve the original clustering.

Screen Shot 2022-01-24 at 11.32.48 AM

What t-SNE does is find a way to project data into a low dimensional space so that the clustering in the high dimensional space is preserved.

How doest t-SNE do that?

  1. We’ll start with the original 2-D scatter plot.
  2. Then, we’ll put the points on the number line in a random order.
  3. From here on out, t-SNE moves these points, a little bit at a time, until it has clustered them.
Screen Shot 2022-01-24 at 11.38.33 AM

How does t-SNE move data points?

Let’s start with the first data point on the left? How does t-SNE decide where to move it (right or left) and by how much?
- Because it is part of the “red” clusters, it wants to move closer to the other red dots on the number line.
- But, at the same time, “blue” and “yellow” points are far away in the scatter plot, so they push back.
- Basically, red points attract, and other points repel.
- In this case, the attraction is strongest, so the first red dot on the left moves a little to the right.

At each step, a point on the line is attracted to points it is near in the scatter plot, and repelled by points it is far from.

t-SNE Steps

  1. Determine the “similarity” of all points in the scatter plot. How t-SNE determines similarity?
    1. Measure the distance between two points.
    2. Plot that distance on a normal curve that is centered on the point of interest.
    3. Draw a line from the point to the curve. The length of that line is the "unscaled similarity".
    4. Note: The width of the normal curve depends on the density of data near the point of interest. Less dense regions have wider curves.
    5. Note: Using a normal distribution means that distant points have very low similarity values, and close points have high similarity values.
Screen Shot 2022-01-24 at 11.51.50 AM
  1. Once we measure the distance between all the points, we plot them on a normal curve and measure the distances from the points to the curve to get the unscaled similarity scores with respect to the point of interest.
Screen Shot 2022-01-24 at 11.55.09 AM
  1. Scale the unscaled similarities so they add up to 1.
    • Why is has to add up to 1? The difference in density of data points (which results in differences in the standard deviation of the normal curve) can result in similarities getting under/over-estimated (see image below). Similarity score won’t be much affected by the cluster’s density this way.
    • How to scale it? scoresum of all scores=scaled score\frac{\small{\text{score}}}{\small{\text{sum of all scores}}} = \text{scaled score}.
    • Perplexity parameter: In the t-SNE’s actual formulation, it has a “perplexity parameter” equal to the expected density around each point.
    • Note: Because the width of the normal distribution is based on the density of the surrounding data points, the similarity between the two points is not symmetrical (each direction might have a different similarity). t-SNE just averages the two similarity scores.
Screen Shot 2022-01-24 at 12.06.27 PM
  1. Ultimately, you end up with a matrix of similarity scores. Each row and column represents the similarity scores calculated from that point of interest.
    • Notes: t-SNE defines the similarity score of each point to itself (the diagonal) as zero since that similarity score doesn’t really help with the clustering.
Screen Shot 2022-01-24 at 12.15.33 PM
  1. Once, we’re done with calculating the similarity scores, we project data points into a lower-dimensional space, and calculate similarity scores for the points on the lower-dimensional space.
    • That means picking a point, measuring a distance, and lastly drawing a line from the point to a curve. However, this time we’re using a t-distribution.
    • t-Distribution is a lot like a normal distribution, except that the t-distribution is as tall in the middle, and the tails are taller (fatter) on the ends.
    • Why t-distribution? Without it, clusters will all clump up in the middle and be harder to see.
  2. Using t-distribution we calculate unscaled similarity scores for all the points and then scale them like before. Like before, we end up with a matrix of similarity scores, but this matrix is a mess (see image below) compared to the original matrix.
    • The goal of moving points is to make two similarity matrices as similar as possible.
    • t-SNE uses small movements because it is a little like a chess game and can’t be solved all at once. It goes one move at a time.
Screen Shot 2022-01-24 at 12.24.39 PM

Formal Explanation

Source.

What is SNE?

In SNE the aim is to match distributions of distances between points in high and low dimensional space via conditional probabilities. SNE assumes distances in both high and low dimensional spaces are Gaussian-distributed.

pji=exixj22σi2kiexixk22σi2p_{j|i} = \frac{e^{\large{{-\frac{||x_i - x_j||^2}{2\sigma^2_i}}}}}{\sum\limits_{k \neq i}{e^{\large{{-\frac{||x_i - x_k||^2}{2\sigma^2_i}}}}}}

qji=eyiyj2kieyiyk2q_{j|i} = \frac{e^{-||y_i - y_j||^2}}{\sum\limits_{k \neq i}{e^{-||y_i - y_k||^2}}}

The objective is to reduce the difference between piip_{i|i} and qiiq_{i|i} such that when you project the high dimensional data into low dimensional space, it looks as similar to high dimensional as possible.

C=iKL(PiQi)=ijpjilog(pjiqji)C = \sum\limits_i KL(P_i||Q_i) = \sum\limits_i \sum\limits_j p_{j|i} \log(\frac{p_{j|i}}{q_{j|i}})

Since KL divergence is asymmetric,

Hence the local structure is highly preserved.

σi\sigma_i is associated with a parameter called perplexity which can be loosely interpreted as the number of close neighbors each point has.

The gradient of cost function is,

Cyi=2j(pjiqji+pijqij)(yiyj)\frac{\partial C}{\partial y_i} = 2 \sum\limits_j (p_{j|i} - q_{j|i} + p_{i|j} - q_{i|j})(y_i - y_j)

y(t)=y(t1)+ηCy+α(t)(y(t1)y(t2))y^{(t)} = y^{(t-1)} + \eta \frac{\partial C}{\partial y} + \alpha(t)(y^{(t-1)} - y^{(t-2)})

where

SNE has two main drawbacks:

  1. Cost function is difficult to optimize
  2. Crowding problem
    • If there are points are moderately far apart, and there are points are nearby, SNE tends to clump all of them together.

Novel features in t-SNE:

t-SNE cost function has two distinct features:

The main feature in symmetric SNE is that pji=pijp_{j|i} = p_{i|j} and pii=qii=0p_{ii} = q_{ii} = 0.

The gradient of cost function is,

Cyi=4j(pjiqji)(yiyj)\frac{\partial C}{\partial y_i} = 4 \sum\limits_j (p_{ji} - q_{ji})(y_i - y_j)

In t-SNE a student t-distribution with one degree of freedom (Cauchy distribution) is used to represent the low dimensional map:

qij=(1+yiyj2)1ki(1+ykyi2)1q_{ij} = \frac{(1 + ||y_i - y_j||^2)^{-1}}{\sum\limits_{k \neq i}(1 + ||y_k - y_i||^2)^{-1}}

t-distribution is robust to outliers and unlike Gaussian distribution it doesn’t have exponent in it so faster to evaluate. The gradient of the cost function becomes,

Cyi=4j(pjiqji)(yiyj)(1+yiyj2)1\frac{\partial C}{\partial y_i} = 4 \sum\limits_j (p_{ji} - q_{ji})(y_i - y_j)(1 + ||y_i - y_j||^2)^{-1}

t-SNE Algorithm

The general idea of t-SNE algorithm:

BEGIN

  1. Compute pairwise affinities pjip_{j|i} with perplexity Perp
  2. Set pij=pij+pji2np_{ij} = \frac{p_{i|j} + p_{j|i}}{2n}, where nn = number of data points
  3. Sample initial solution Y(0)=y1,y2,,ynY^{(0)} = y_1, y_2, \dots, y_n from N(0,104I)N(0, 10^{-4}I)
    • For t=1t=1 to TT:
      • Compute low-dimensional affinities, qijq_{ij}
      • Compute gradient Cy\frac{\partial C}{\partial y}
      • Set y(t)=y(t1)+ηCy+α(t)(y(t1)y(t2))y^{(t)} = y^{(t-1)} + \eta \frac{\partial C}{\partial y} + \alpha(t)(y^{(t-1)} - y^{(t-2)})
    • END
  4. END

Criticisms of t-SNE

There is a great paper by Wattenberg, et. al How to Use t-SNE Effectively, that explains some of the drawbacks of t-SNE:

Summary