Source: Probabilistic Machine Learning: An Itroduction by Murphy

Introduction

The Importance of Dimensionality Reduction (DR)

  1. A lower number of dimensions in data means less training times and fewer computational resources which increases the overall performance of ML algorithms.
  2. DR avoids the problem of overfitting.
  3. DR takes care of multicollinearity.
  4. DR removes noise in data.
  5. DR can be used for data compression.

Principal Component Analysis (PCA)

What is PCA?

How to minimize the objective function?

The objective function is to minimize the reconstruction error,

L(W,Z)=1Nn=1NxnWzn2\mathcal{L}(\mathbf{W}, \mathbf{Z}) = \frac{1}{N} \sum\limits_{n=1}^N ||\mathbf{x}_n - \mathbf{W} \mathbf{z}_n||^2

subject to the constraint that W\mathbf{W} is an orthogonal matrix.

Note: If XX is a N×DN \times D matrix, and the low-dimensional space has a dimension LL, then W\mathbf{W} size is L×DL \times D and latent factors, z\mathbf{z}, have the size N×LN \times L.

It can be shown that the optimal solution is obtained by setting W^=UL\hat{\mathbf{W}} = \mathbf{U}_L where UL\mathbf{U}_L contains the LL eigenvectors with largest eigenvalues of the empirical covariance matrix.

Σ^=1Nn=1N(xnxˉ)(xnxˉ)T=1NXcTXc\hat{\mathbf{\Sigma}} = \frac{1}{N} \sum\limits_{n=1}^N (\mathbf{x}_n - \bar{\mathbf{x}})(\mathbf{x}_n - \bar{\mathbf{x}})^T = \frac{1}{N}\mathbf{X}_c^T \mathbf{X}_c

where Xc\mathbf{X}_c is a centered version of X\mathbf{X}.

Note: Optimal weight vector maximizes the variance of the projected data. In other words, minimizing the reconstruction error is equivalent to maximizing the variance of the projected data. This is why it is often said that PCA finds the directions of maximal variance.

Note: This is equivalent to maximizing the likelihood of a latent linear Gaussian model known as probabilistic PCA.

Note: For details on deriving the optimal solution refer to the book page 653.

Computational Issues

Covariance matrix vs correlation matrix

Dealing with high-dimensional data

(XXT)U=UΛ(\mathbf{X}\mathbf{X}^T)\mathbf{U} = \mathbf{U}\mathbf{\Lambda}

(XTX)(XTU)=(XTU)Λ(\mathbf{X}^T\mathbf{X})(\mathbf{X}^T\mathbf{U}) = (\mathbf{X}^T\mathbf{U})\mathbf{\Lambda}

V=XTUΛ12\mathbf{V} = \mathbf{X}^T\mathbf{U}\mathbf{\Lambda}^{\frac{1}{2}}

Computing PCA using SVD

Z=XW=UXSXVXTVX=UXSX\mathbf{Z} = \mathbf{X}\mathbf{W} = \mathbf{U}_X\mathbf{S}_X\mathbf{V}_X^T\mathbf{V}_X = \mathbf{U}_X\mathbf{S}_X

X^=ZWT=UXSXVXT\hat{\mathbf{X}} = \mathbf{Z}\mathbf{W}^T = \mathbf{U}_X\mathbf{S}_X\mathbf{V}_X^T

Choosing the number of latent dimensions

Scree plots

FL=j=1Lλjj=1LmaxλjF_L = \frac{\sum_{j=1}^L \lambda_j}{\sum_{j'=1}^{L^{\text{max}}} \lambda_j'}

Importance of Feature Scaling for PCA

Feature scaling through standardization (or Z-score normalization) can be an important preprocessing step for many machine learning algorithms. Standardization involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.

While many algorithms (such as SVM, K-nearest neighbors, and logistic regression) require features to be normalized, intuitively we can think of Principle Component Analysis (PCA) as being a prime example of when normalization is important. In PCA we are interested in the components that maximize the variance. If one component (e.g. human height) varies less than another (e.g. weight) because of their respective scales (meters vs. kilos), PCA might determine that the direction of maximal variance more closely corresponds with the ‘weight’ axis, if those features are not scaled. As a change in height of one meter can be considered much more important than the change in weight of one kilogram, this is clearly incorrect.

For a visualized example of the impact of scaling check out this link from sklearn website.

PCA implementation

Source

Exact PCA and the probabilistic interpretation

PCA is used to decompose a multivariate dataset in a set of successive orthogonal components that explain a maximum amount of the variance. In scikit-learn, PCA is implemented as a transformer object that learns nn components in its fit method, and can be used on new data to project it on these components.

PCA centers but does not scale the input data for each feature before applying the SVD. The optional parameter whiten=True makes it possible to project the data onto the singular space while scaling each component to unit variance. This is often useful if the models down-stream make strong assumptions on the isotropy of the signal: this is for example the case for Support Vector Machines with the RBF kernel and the K-Means clustering algorithm.

The PCA object also provides a probabilistic interpretation of the PCA that can give a likelihood of data based on the amount of variance it explains. As such it implements a score method that can be used in cross-validation:

../_images/sphx_glr_plot_pca_vs_fa_model_selection_001.png

Incremental PCA

The PCA object is very useful, but has certain limitations for large datasets. The biggest limitation is that PCA only supports batch processing, which means all of the data to be processed must fit in main memory. The IncrementalPCA object uses a different form of processing and allows for partial computations which almost exactly match the results of PCA while processing the data in a minibatch fashion. IncrementalPCA makes it possible to implement out-of-core Principal Component Analysis either by:

IncrementalPCA only stores estimates of component and noise variances, in order update explained_variance_ratio_ incrementally. This is why memory usage depends on the number of samples per batch, rather than the number of samples to be processed in the dataset.

As in PCA, IncrementalPCA centers but does not scale the input data for each feature before applying the SVD.

PCA using randomized SVD

It is often interesting to project data to a lower-dimensional space that preserves most of the variance, by dropping the singular vector of components associated with lower singular values.

For instance, if we work with 64x64 pixel gray-level pictures for face recognition, the dimensionality of the data is 4096 and it is slow to train an RBF support vector machine on such wide data. Furthermore we know that the intrinsic dimensionality of the data is much lower than 4096 since all pictures of human faces look somewhat alike. The samples lie on a manifold of much lower dimension (say around 200 for instance). The PCA algorithm can be used to linearly transform the data while both reducing the dimensionality and preserve most of the explained variance at the same time.

The class PCA used with the optional parameter svd_solver='randomized' is very useful in that case: since we are going to drop most of the singular vectors it is much more efficient to limit the computation to an approximated estimate of the singular vectors we will keep to actually perform the transform.

Note: the implementation of inverse_transform in PCA with svd_solver='randomized' is not the exact inverse transform of transform even when whiten=False (default).

SparsePCA & MiniBatchSparsePCA

SparsePCA is a variant of PCA, with the goal of extracting the set of sparse components that best reconstruct the data.

Mini-batch sparse PCA (MiniBatchSparsePCA) is a variant of SparsePCA that is faster but less accurate. The increased speed is reached by iterating over small chunks of the set of features, for a given number of iterations.

Principal component analysis (PCA) has the disadvantage that the components extracted by this method have exclusively dense expressions, i.e. they have non-zero coefficients when expressed as linear combinations of the original variables. This can make interpretation difficult. In many cases, the real underlying components can be more naturally imagined as sparse vectors; for example in face recognition, components might naturally map to parts of faces.

Sparse principal components yields a more parsimonious, interpretable representation, clearly emphasizing which of the original features contribute to the differences between samples.

The following example illustrates 16 components extracted using sparse PCA from the Olivetti faces dataset. It can be seen how the regularization term induces many zeros. Furthermore, the natural structure of the data causes the non-zero coefficients to be vertically adjacent. The model does not enforce this mathematically: each component is a vector hR4096h \in \mathbf{R}^{4096}, and there is no notion of vertical adjacency except during the human-friendly visualization as 64x64 pixel images. The fact that the components shown below appear local is the effect of the inherent structure of the data, which makes such local patterns minimize reconstruction error. There exist sparsity-inducing norms that take into account adjacency and different kinds of structure; see [Jen09] for a review of such methods. For more details on how to use Sparse PCA, see the Examples section, below.

Screen Shot 2022-04-13 at 12.15.34 PM

Exact Kernel PCA

KernelPCA is an extension of PCA which achieves non-linear dimensionality reduction through the use of kernels (see Pairwise metrics, Affinities and Kernels) [Scholkopf1997]. It has many applications including denoising, compression and structured prediction (kernel dependency estimation). KernelPCA supports both transform and inverse_transform.

../_images/sphx_glr_plot_kernel_pca_002.png

Note: KernelPCA.inverse_transform relies on a kernel ridge to learn the function mapping samples from the PCA basis into the original feature space [Bakir2004]. Thus, the reconstruction obtained with KernelPCA.inverse_transform is an approximation.

Kernel PCA example

Choice of solver for Kernel PCA

While in PCA the number of components is bounded by the number of features, in KernelPCA the number of components is bounded by the number of samples. Many real-world datasets have large number of samples! In these cases finding all the components with a full kPCA is a waste of computation time, as data is mostly described by the first few components (e.g. n_components<=100). In other words, the centered Gram matrix that is eigendecomposed in the Kernel PCA fitting process has an effective rank that is much smaller than its size. This is a situation where approximate eigensolvers can provide speedup with very low precision loss.

The optional parameter eigen_solver='randomized' can be used to significantly reduce the computation time when the number of requested n_components is small compared with the number of samples. It relies on randomized decomposition methods to find an approximate solution in a shorter time.

Truncated SVD & Latent Semantic Analysis

TruncatedSVD implements a variant of singular value decomposition (SVD) that only computes the kk largest singular values, where kk is a user-specified parameter.

When truncated SVD is applied to term-document matrices (as returned by CountVectorizer or TfidfVectorizer), this transformation is known as latent semantic analysis (LSA), because it transforms such matrices to a “semantic” space of low dimensionality. In particular, LSA is known to combat the effects of synonymy and polysemy (both of which roughly mean there are multiple meanings per word), which cause term-document matrices to be overly sparse and exhibit poor similarity under measures such as cosine similarity.

Mathematically, truncated SVD applied to training samples X\mathbf{X} produces a low-rank approximation X\mathbf{X}:

XXk=UkΣkVkT\mathbf{X} \approx \mathbf{X}_k = \mathbf{U}_k\mathbf{\Sigma}_k\mathbf{V}_k^T

After this operation, UkΣk\mathbf{U}_k\mathbf{\Sigma}_k is the transformed training set with kk features.

To also transform a test set X\mathbf{X}, we multiply it with Vk\mathbf{V}_k:

X=XVk\mathbf{X}' = \mathbf{X}\mathbf{V}_k

Note: Most treatments of LSA in the natural language processing (NLP) and information retrieval (IR) literature swap the axes of the matrix X\mathbf{X} so that it has shape n_features ×\times n_samples. We present LSA in a different way that matches the scikit-learn API better, but the singular values found are the same.

Note: TruncatedSVD is very similar to PCA, but differs in that the matrix X does not need to be centered. When the columnwise (per-feature) means of X are subtracted from the feature values, truncated SVD on the resulting matrix is equivalent to PCA. In practical terms, this means that the TruncatedSVD transformer accepts scipy.sparse matrices without the need to densify them, as densifying may fill up memory even for medium-sized document collections.

Note: While the TruncatedSVD transformer works with any feature matrix, using it on tf–idf matrices is recommended over raw frequency counts in an LSA/document processing setting. In particular, sublinear scaling and inverse document frequency should be turned on (sublinear_tf=True, use_idf=True) to bring the feature values closer to a Gaussian distribution, compensating for LSA’s erroneous assumptions about textual data.

Factor Analysis

I will add the notes on this later.

Autoencoders

Bottleneck autoencoders

Denoising autoencoders (DAE)

Contractive autoencoders (CAE)

Ω(z,x)=λfe(x)xF2=λΔxhk(x)F2\mathbf{\Omega}(\mathbf{z},\mathbf{x})=\lambda||\frac{\partial f_e(\mathbf{x})}{\partial \mathbf{x}}||_F^2 = \lambda \sum ||\Delta_{\mathbf{x}}h_k(\mathbf{x})||_F^2

Variational autoencoders (VAE)

More on this later.

Spectral Clustering

Take 1

Normalized cuts

Ncut(S1,S2,,SK)12k=1Kcut(Sk,Sˉk)vol(Sk)\text{Ncut}(S_1, S_2, \dots, S_K) \triangleq \frac{1}{2} \sum\limits_{k=1}^{K} \frac{\text{cut}(S_k, \bar{S}_k)}{\text{vol}(S_k)}

Screen Shot 2022-04-04 at 12.38.39 PM

Take 2

Graph Cut

cut(C1,,Ck)=i=1krCi,sCiWr,s\text{cut}(C_1, \dots, C_k) = \sum\limits_{i=1}^{k} \sum\limits_{r \in C_i, s \notin C_i} W_{r,s}

RatioCut(C1,,Ck)=i=1k1CirCi,sCiWr,s\text{RatioCut}(C_1, \dots, C_k) = \sum\limits_{i=1}^{k} \frac{1}{|C_i|} \sum\limits_{r \in C_i, s \notin C_i} W_{r,s}

Graph Laplacian and Relaxed Graph Cuts

Hi,j=1Cj1[iCj]H_{i,j} = \frac{1}{\sqrt{|C_j|}}\mathbb{1}_{[i \in C_j]}

RatioCut(C1,,Ck)=trace(HTLH)\text{RatioCut}(C_1, \dots, C_k) = \text{trace}(H^TLH)

Spectral Clustering Algorithm

Screen Shot 2022-04-12 at 3.34.16 PM

Take 3

Source

Eigenvalues and eigenvectors

import numpy as np

A = np.array([[0,1], [-2,3]])

# find eigenvectors and eigenvalues
vals, vecs = np.linalg.eig(A)

Adjacency Matrix

A = np.array([
  [0, 1, 1, 0, 0, 0, 0, 0, 1, 1],
  [1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  [1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
  [0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
  [0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
  [0, 0, 0, 1, 1, 0, 1, 1, 0, 0],
  [0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
  [0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
  [1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
  [1, 0, 0, 0, 0, 0, 0, 0, 1, 0]])

Degree Matrix

D = np.diag(A.sum(axis=1))
print(D)

# [[4 0 0 0 0 0 0 0 0 0]
#  [0 2 0 0 0 0 0 0 0 0]
#  [0 0 2 0 0 0 0 0 0 0]
#  [0 0 0 2 0 0 0 0 0 0]
#  [0 0 0 0 2 0 0 0 0 0]
#  [0 0 0 0 0 4 0 0 0 0]
#  [0 0 0 0 0 0 2 0 0 0]
#  [0 0 0 0 0 0 0 2 0 0]
#  [0 0 0 0 0 0 0 0 2 0]
#  [0 0 0 0 0 0 0 0 0 2]]

Graph Laplacian

L = D-A
print(L)

# [[ 4 -1 -1  0  0  0  0  0 -1 -1]
#  [-1  2 -1  0  0  0  0  0  0  0]
#  [-1 -1  2  0  0  0  0  0  0  0]
#  [ 0  0  0  2 -1 -1  0  0  0  0]
#  [ 0  0  0 -1  2 -1  0  0  0  0]
#  [ 0  0  0 -1 -1  4 -1 -1  0  0]
#  [ 0  0  0  0  0 -1  2 -1  0  0]
#  [ 0  0  0  0  0 -1 -1  2  0  0]
#  [-1  0  0  0  0  0  0  0  2 -1]
#  [-1  0  0  0  0  0  0  0 -1  2]]

Eigenvalues of Graph Laplacian


from sklearn.cluster import KMeans

# our adjacency matrix
print("Adjacency Matrix:")
print(A)

# Adjacency Matrix:
# [[0. 1. 1. 0. 0. 1. 0. 0. 1. 1.]
#  [1. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
#  [1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
#  [0. 0. 0. 0. 1. 1. 0. 0. 0. 0.]
#  [0. 0. 0. 1. 0. 1. 0. 0. 0. 0.]
#  [1. 0. 0. 1. 1. 0. 1. 1. 0. 0.]
#  [0. 0. 0. 0. 0. 1. 0. 1. 0. 0.]
#  [0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]
#  [1. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
#  [1. 0. 0. 0. 0. 0. 0. 0. 1. 0.]]

# diagonal matrix
D = np.diag(A.sum(axis=1))

# graph laplacian
L = D-A

# eigenvalues and eigenvectors
vals, vecs = np.linalg.eig(L)

# sort these based on the eigenvalues
vecs = vecs[:,np.argsort(vals)]
vals = vals[np.argsort(vals)]

# kmeans on first three vectors with nonzero eigenvalues
kmeans = KMeans(n_clusters=4)
kmeans.fit(vecs[:,1:4])
colors = kmeans.labels_

print("Clusters:", colors)

# Clusters: [2 1 1 0 0 0 3 3 2 2]

Summary: We first took our graph and built an adjacency matrix. We then created the Graph Laplacian by subtracting the adjacency matrix from the degree matrix. The eigenvalues of the Laplacian indicated that there were four clusters. The vectors associated with those eigenvalues contain information on how to segment the nodes. Finally, we performed K-Means on those vectors in order to get the labels for the nodes. Next, we’ll see how to do this for arbitrary data.

Spectral Clustering Arbitrary Data

Nearest Neighbor Graph

from sklearn.datasets import make_circles
from sklearn.neighbors import kneighbors_graph
import numpy as np

# create the data
X, labels = make_circles(n_samples=500, noise=0.1, factor=.2)

# use the nearest neighbor graph as our adjacency matrix
A = kneighbors_graph(X, n_neighbors=5).toarray()
print(A)

# [[0. 0. 0. ... 0. 0. 0.]
#  [0. 0. 0. ... 0. 0. 0.]
#  [0. 0. 0. ... 0. 0. 0.]
#  ...
#  [0. 0. 0. ... 0. 1. 0.]
#  [0. 0. 0. ... 0. 0. 0.]
#  [0. 0. 0. ... 0. 0. 0.]]
 
# create the graph laplacian
D = np.diag(A.sum(axis=1))
L = D-A

# find the eigenvalues and eigenvectors
vals, vecs = np.linalg.eig(L)

# sort
vecs = vecs[:,np.argsort(vals)]
vals = vals[np.argsort(vals)]

# use Fiedler value to find best cut to separate data
clusters = vecs[:,1] > 0

Other Approaches

Conclusion

Take 4

Another nice article comparing Kernel K-means with spectral clustering.

Other resources