Principal Component Analysis (PCA)

Let’s start with a simple example to build up the intuition for PCA. Let’s say we have the following dataset of mice and their genes,

Mouse 1 Mouse 2 Mouse 2 Mouse 2 Mouse 2 Mouse 2
Gene 1 10 11 8 3 2 1
Gene 2 6 4 5 3 2.8 1

If we only measure 1 gene, we can plot the data on a number line, we’ll see that mice 1-3 have relatively higher values than mice 4-6. We can then add Gene 2 measurements and plot a 2D graph.

PCA for 2-dimensional data

We start by plotting the data above

  1. Then, we’ll calculate the average measurement for Gene 1 and the average measurement for Gene 2 (as shown below) \rightarrow This way we can calculate the center of the data.
Screen Shot 2021-12-29 at 4.40.47 PM
  1. Now, we’ll shift the data so that the center is on top of the origin (0,0)(0,0) in the graph.
    • Note: Shifting the data did not change how the data points are positioned relative to each other.
Screen Shot 2021-12-29 at 4.45.39 PM
  1. Now that the data is centered on the origin, we can try to fit a line to it. To do this:
    • We start by drawing a random line that goes through the origin.
    • Then, we rotate the line until it fits the data as well as it can, given that it has to go through the origin.
    • Ultimately, find the line that fits best. But, how does PCA decide whether a fit is good or not?
    • To quantify the fit, PCA projects the data onto the line, and then it can either measure the distances from the data to the line and try to find the line that minimizes those distances or it can try to find the line that maximizes the distances from the projected points to the origin (see below) \rightarrow These two are essentially equivalent.
    • To understand this better, we can look at one point’s projection to the line. Since a2a^2 (distance of the point from the origin) is fixed, but rotating the line, only b2b^2 and c2c^2 can change and they’re inversely related (see below). Thus, PCA can either minimize the distance to the line (b2b^2) or maximize the distance from the projected point to the origin.
    • Intuitively, it makes sense to minimize bb, but it’s actually easier to calculate cc, so PCA finds the best fitting line by maximizing the sum of the squared distance from the projected points to the origin.
Screen Shot 2021-12-29 at 4.55.26 PM
Screen Shot 2021-12-29 at 4.59.45 PM
  1. PCA measures the distance (dd) from the projected point to the origin for all the data points (see below)
Screen Shot 2021-12-29 at 5.09.47 PM
  1. Next, we square all the distances and sum them up. Distances are squared so that negative values don’t cancel out positive values.

d12+d22+d32+d42+d52+d62=sum of squared distances = SS(distances)d_1^2+d_2^2+d_3^2+d_4^2+d_5^2+d_6^2=\small{\text{sum of squared distances = SS(distances)}}

  1. We keep rotating the line until we end up with the line with the largest SS(distances)\textbf{\small{\text{SS(distances)}}} between the projected points and the origin.
  2. Ultimately, we end up with the line that has the largest SS(distances)\textbf{\small{\text{SS(distances)}}}. This line is called Principle Component 1 (PC1).
    • The slope of PC1 shows the contribution of each feature to it. For example, a slope of 0.25 shows for every 4 units that we go along the Gene 1 axis, we go up 1 unit along the Gene 2 axis. That means that the data are mostly spread out along the Gene 1 axis. The ratio of Gene 1 to Gene 2 tells you that Gene 1 is more important when it comes to describing how the data are spread out. In mathematical terms, it’s called a linear combination of Gene 1 and Gene 2 or PC is a linear combination of variables.
Screen Shot 2021-12-29 at 8.22.20 PM
  1. We can compute the length PC1 using the Pythagorean theorem: a2=b2+c2    (12+42)=4.12a^2 = b^2+c^2 \implies\sqrt{(1^2+4^2)}=4.12.
Screen Shot 2021-12-29 at 8.31.44 PM
  1. When you do PCA with SVD, the recipe for PC1 is scaled so that length=1. So, we have to divide all 3 sides by 4.12. This 1 unit long vector is called the Singular Vector or the Eigenvector for PC1. The values of Gene 1 and 2 will be 0.97 and 0.242 respectively. The proportions of each gene’s contribution to the PC1’s unit vector are called loading scores.
    • PCA calls the SS(distances)\textbf{\small{\text{SS(distances)}}} for the best fit line the eigenvalue for PC1.
    • Watch this video to understand eigenvalues and eigenvectors.
    • Watch video1, video2 and video3 for SVD.

SS(distances for PC1) = eigenvalue for PC1\small{\text{SS(distances for PC1) = eigenvalue for PC1}}
eigenvalue for PC1 = singular value for PC1\sqrt{\small{\text{eigenvalue for PC1}}}\small{\text{ = singular value for PC1}}

How to get PC2, PC3, …?

Draw PCA plot

Further notes

SS(distances for PC1)n1 = variation for PC1\small{\frac{\text{SS(distances for PC1)}}{n-1}\text{ = variation for PC1}}

SS(distances for PC2)n1 = variation for PC2\small{\frac{\text{SS(distances for PC2)}}{n-1}\text{ = variation for PC2}}

total variation = variation PC1 + variation PC2\small{\text{total variation = variation PC1 + variation PC2}}