Let’s start with a simple example to build up the intuition for PCA. Let’s say we have the following dataset of mice and their genes,
Mouse 1
Mouse 2
Mouse 2
Mouse 2
Mouse 2
Mouse 2
Gene 1
10
11
8
3
2
1
Gene 2
6
4
5
3
2.8
1
If we only measure 1 gene, we can plot the data on a number line, we’ll see that mice 1-3 have relatively higher values than mice 4-6. We can then add Gene 2 measurements and plot a 2D graph.
PCA for 2-dimensional data
We start by plotting the data above
Then, we’ll calculate the average measurement for Gene 1 and the average measurement for Gene 2 (as shown below) → This way we can calculate the center of the data.
Now, we’ll shift the data so that the center is on top of the origin (0,0) in the graph.
Note: Shifting the data did not change how the data points are positioned relative to each other.
Now that the data is centered on the origin, we can try to fit a line to it. To do this:
We start by drawing a random line that goes through the origin.
Then, we rotate the line until it fits the data as well as it can, given that it has to go through the origin.
Ultimately, find the line that fits best. But, how does PCA decide whether a fit is good or not?
To quantify the fit, PCA projects the data onto the line, and then it can either measure the distances from the data to the line and try to find the line that minimizes those distances or it can try to find the line that maximizes the distances from the projected points to the origin (see below) → These two are essentially equivalent.
To understand this better, we can look at one point’s projection to the line. Since a2 (distance of the point from the origin) is fixed, but rotating the line, only b2 and c2 can change and they’re inversely related (see below). Thus, PCA can either minimize the distance to the line (b2) or maximize the distance from the projected point to the origin.
Intuitively, it makes sense to minimize b, but it’s actually easier to calculate c, so PCA finds the best fitting line by maximizing the sum of the squared distance from the projected points to the origin.
PCA measures the distance (d) from the projected point to the origin for all the data points (see below)
Next, we square all the distances and sum them up. Distances are squared so that negative values don’t cancel out positive values.
d12+d22+d32+d42+d52+d62=sum of squared distances = SS(distances)
We keep rotating the line until we end up with the line with the largest SS(distances) between the projected points and the origin.
Ultimately, we end up with the line that has the largest SS(distances). This line is called Principle Component 1 (PC1).
The slope of PC1 shows the contribution of each feature to it. For example, a slope of 0.25 shows for every 4 units that we go along the Gene 1 axis, we go up 1 unit along the Gene 2 axis. That means that the data are mostly spread out along the Gene 1 axis. The ratio of Gene 1 to Gene 2 tells you that Gene 1 is more important when it comes to describing how the data are spread out. In mathematical terms, it’s called a linear combination of Gene 1 and Gene 2 or PC is a linear combination of variables.
We can compute the length PC1 using the Pythagorean theorem: a2=b2+c2⟹(12+42)=4.12.
When you do PCA with SVD, the recipe for PC1 is scaled so that length=1. So, we have to divide all 3 sides by 4.12. This 1 unit long vector is called the Singular Vector or the Eigenvector for PC1. The values of Gene 1 and 2 will be 0.97 and 0.242 respectively. The proportions of each gene’s contribution to the PC1’s unit vector are called loading scores.
PCA calls the SS(distances) for the best fit line the eigenvalue for PC1.
Watch this video to understand eigenvalues and eigenvectors.
SS(distances for PC1) = eigenvalue for PC1 eigenvalue for PC1 = singular value for PC1
How to get PC2, PC3, …?
PC2 is simply the line through the origin that is perpendicular to PC1, without any further optimization that has to be done.
For our example, this means that PC2 is -1 parts Gene 1 and 4 parts Gene 2, which if we scale them they become -0.242 parts Gene 1 and 0.97 parts Gene 2 (loading scores for PC2).
Similar calculation for eigenvalue and eigenvector for PC2.
Draw PCA plot
To draw the PCA plot, we simply rotate everything so that PC1 is horizontal.
Then, we use the projected points to find where samples go in the PCA plot.
Further notes
We get eigenvalues by projecting the data onto the principal components, measuring the distances to the origin, then squaring and adding them together.
We can convert them into variation around the origin (0,0) by dividing by the sample size minus 1 (i.e. n−1).
n−1SS(distances for PC1) = variation for PC1
n−1SS(distances for PC2) = variation for PC2
total variation = variation PC1 + variation PC2
For example, say, variation for PC1 is 15, and variation for PC2 is 3. So, the total variation is 18.
That means PC1 accounts for 15/18 = 0.83 or 83% of total variations.
The number of PCs can be equal to the number of variables or number of observations, whichever is lower.