t-SNE takes a high dimensional dataset and reduces it to a low dimensional graph that retains a lot of the original information.
How does t-SNE work?
Let’s see how t-SNE transforms this 2-D scatter plot into a 1-D plot on a number line.
Note: if we just projected the data onto one of the axes, we’d just get a big mess that doesn’t preserve the original clustering.
What t-SNE does is find a way to project data into a low dimensional space so that the clustering in the high dimensional space is preserved.
How doest t-SNE do that?
We’ll start with the original 2-D scatter plot.
Then, we’ll put the points on the number line in a random order.
From here on out, t-SNE moves these points, a little bit at a time, until it has clustered them.
How does t-SNE move data points?
Let’s start with the first data point on the left? How does t-SNE decide where to move it (right or left) and by how much?
- Because it is part of the “red” clusters, it wants to move closer to the other red dots on the number line.
- But, at the same time, “blue” and “yellow” points are far away in the scatter plot, so they push back.
- Basically, red points attract, and other points repel.
- In this case, the attraction is strongest, so the first red dot on the left moves a little to the right.
At each step, a point on the line is attracted to points it is near in the scatter plot, and repelled by points it is far from.
t-SNE Steps
Determine the “similarity” of all points in the scatter plot. How t-SNE determines similarity?
Measure the distance between two points.
Plot that distance on a normal curve that is centered on the point of interest.
Draw a line from the point to the curve. The length of that line is the "unscaled similarity".
Note: The width of the normal curve depends on the density of data near the point of interest. Less dense regions have wider curves.
Note: Using a normal distribution means that distant points have very low similarity values, and close points have high similarity values.
Once we measure the distance between all the points, we plot them on a normal curve and measure the distances from the points to the curve to get the unscaled similarity scores with respect to the point of interest.
Scale the unscaled similarities so they add up to 1.
Why is has to add up to 1? The difference in density of data points (which results in differences in the standard deviation of the normal curve) can result in similarities getting under/over-estimated (see image below). Similarity score won’t be much affected by the cluster’s density this way.
How to scale it?sum of all scoresscore=scaled score.
Perplexity parameter: In the t-SNE’s actual formulation, it has a “perplexity parameter” equal to the expected density around each point.
Note: Because the width of the normal distribution is based on the density of the surrounding data points, the similarity between the two points is not symmetrical (each direction might have a different similarity). t-SNE just averages the two similarity scores.
Ultimately, you end up with a matrix of similarity scores. Each row and column represents the similarity scores calculated from that point of interest.
Notes: t-SNE defines the similarity score of each point to itself (the diagonal) as zero since that similarity score doesn’t really help with the clustering.
Once, we’re done with calculating the similarity scores, we project data points into a lower-dimensional space, and calculate similarity scores for the points on the lower-dimensional space.
That means picking a point, measuring a distance, and lastly drawing a line from the point to a curve. However, this time we’re using a t-distribution.
t-Distribution is a lot like a normal distribution, except that the t-distribution is as tall in the middle, and the tails are taller (fatter) on the ends.
Why t-distribution? Without it, clusters will all clump up in the middle and be harder to see.
Using t-distribution we calculate unscaled similarity scores for all the points and then scale them like before. Like before, we end up with a matrix of similarity scores, but this matrix is a mess (see image below) compared to the original matrix.
The goal of moving points is to make two similarity matrices as similar as possible.
t-SNE uses small movements because it is a little like a chess game and can’t be solved all at once. It goes one move at a time.
t-SNE stands for t-distributed Stochastic Neighborhood Embedding.
t-SNE is a dimension reduction/data visualization method, proposed by Maaten & Hinton in 2008.
t-SNE tends to preserve local structure at the same time preserving the global structure as much as possible.
Note: Other methods like PCA or Multi-Dimensional Scaling (MDS) try to preserve the global structure and in that process they lose the local structure.
What is SNE?
In SNE the aim is to match distributions of distances between points in high and low dimensional space via conditional probabilities. SNE assumes distances in both high and low dimensional spaces are Gaussian-distributed.
Let xi be the ith object in high dimensional space and yi be the ith object in low dimensional space.
You construct the conditional probability of the similarity between points i and j, pj∣i, as follows,
Note: This looks like a kernel of normal distribution.
Note: If you have two data points that are very close to each other (i.e. ∣∣xi−xj∣∣ is very small), then their similarity are going to be closer to 1 (since x→0e→1), and visa versa for when the points are far apart.
qj∣i=k=i∑e−∣∣yi−yk∣∣2e−∣∣yi−yj∣∣2
qj∣i is the conditional probability similarity in the low dimensional space.
In qj∣i, it’s assumed that σ=21 (or 2σ2=1).
Note: It’s also assumed that pi∣i=qi∣i=0 (since self-similarity is not informative in this case).
The objective is to reduce the difference between pi∣i and qi∣i such that when you project the high dimensional data into low dimensional space, it looks as similar to high dimensional as possible.
This goal is achieved by minimizing the sum of Kullback-Leibler divergences:
C=i∑KL(Pi∣∣Qi)=i∑j∑pj∣ilog(qj∣ipj∣i)
Since KL divergence is asymmetric,
Large cost for representing nearby data points in high dimensional map by widely separated points in the low dimensional map.
Smaller cost for representing widely separated data points in high dimensional map by nearby points in the low dimension.
Hence the local structure is highly preserved.
σi is associated with a parameter called perplexity which can be loosely interpreted as the number of close neighbors each point has.
σi is found via binary search.
The gradient of cost function is,
∂yi∂C=2j∑(pj∣i−qj∣i+pi∣j−qi∣j)(yi−yj)
SNE uses gradient descent for optimization.
In case SNE, in addition to the gradient of the cost function it also has a momentum term to speed up the optimization and to avoid local optima.
y(t)=y(t−1)+η∂y∂C+α(t)(y(t−1)−y(t−2))
where
α(t): momentum at iteration t
y(t): solution at iteration t
η: learning rate
SNE has two main drawbacks:
Cost function is difficult to optimize
Crowding problem
If there are points are moderately far apart, and there are points are nearby, SNE tends to clump all of them together.
Novel features in t-SNE:
t-SNE cost function has two distinct features:
Cost function is a symmetrized version of that in SNE, i.e. (pj∣i=pi∣jqj∣i=qi∣j)
Student t-distribution is used to compute the similarities between data points in the low dimensional space.
The main feature in symmetric SNE is that pj∣i=pi∣j and pii=qii=0.
The gradient of cost function is,
∂yi∂C=4j∑(pji−qji)(yi−yj)
In t-SNE a student t-distribution with one degree of freedom (Cauchy distribution) is used to represent the low dimensional map:
qij=k=i∑(1+∣∣yk−yi∣∣2)−1(1+∣∣yi−yj∣∣2)−1
t-distribution is robust to outliers and unlike Gaussian distribution it doesn’t have exponent in it so faster to evaluate. The gradient of the cost function becomes,