Spring 2018
Cancer research: Look for subgroups within the patients or within the genes in order to better understand the disease
Online shopping site: Identify groups of shoppers as well as groups of items within each of those shoppers groups.
Cancer research: Look for subgroups within the patients or within the genes in order to better understand the disease
Online shopping site: Identify groups of shoppers as well as groups of items within each of those shoppers groups.
Search engine: Search only a subset of the documents in order to find the best one for retrieval.
+++
There is usually no obvious ground-truth to compare to
There is usually no obvious ground-truth to compare to
Covered in this module:
We want to visualize \(n\) observations with \(p\) features
Two-dimensional scatterplots of data
We want to create a \(n \times M\) matrix \(Z\), with \(M < p\).
The column \(Z_m\) of \(Z\) is the \(m\)-th principal component.
\[Z_m = \sum_{j=1}^{p} \phi_{jm} X_j \quad \text{subject to} \quad \sum_{j=1}^p \phi_{jm}^2 = 1\]
We want to create a \(n \times M\) matrix \(Z\), with \(M < p\).
The column \(Z_m\) of \(Z\) is the \(m\)-th principal component.
\[Z_m = \sum_{j=1}^{p} \phi_{jm} X_j \quad \text{subject to} \quad \sum_{j=1}^p \phi_{jm}^2 = 1\]
And so on …
M-dimension that capture most of the variability contained in the data
M-dimension that is closest to the data points (average squared euclidean distances)
Let \(\boldsymbol X\) be a matrix with dimension \(n \times p\).
Each column represent a vector of predictors.
Let \(\boldsymbol X\) be a matrix with dimension \(n \times p\).
Each column represent a vector of predictors.
Assume \(\boldsymbol \Sigma\) to be the covariance matrix associated with \(\boldsymbol X\).
Let \(\boldsymbol X\) be a matrix with dimension \(n \times p\).
Each column represent a vector of predictors.
Assume \(\boldsymbol \Sigma\) to be the covariance matrix associated with \(\boldsymbol X\).
We want \(\boldsymbol Z_1 = \boldsymbol \phi_1 \boldsymbol X\), subject to \(||\boldsymbol \phi_1||_2 = 1\)
We want \(\boldsymbol Z_1\) to have the highest possible variance, \(V(\boldsymbol Z_1) = \boldsymbol \phi_1^T \Sigma \boldsymbol \phi_1\)
We want \(\boldsymbol Z_1 = \boldsymbol \phi_1 \boldsymbol X\), subject to \(||\boldsymbol \phi_1||_2 = 1\)
We want \(\boldsymbol Z_1\) to have the highest possible variance, \(V(\boldsymbol Z_1) = \boldsymbol \phi_1^T \Sigma \boldsymbol \phi_1\)
\(\boldsymbol \phi_1\) equals the column eigenvector corresponding with the largest eigenvalue of \(\boldsymbol \Sigma\)
We want \(\boldsymbol Z_1 = \boldsymbol \phi_1 \boldsymbol X\), subject to \(||\boldsymbol \phi_1||_2 = 1\)
We want \(\boldsymbol Z_1\) to have the highest possible variance, \(V(\boldsymbol Z_1) = \boldsymbol \phi_1^T \Sigma \boldsymbol \phi_1\)
\(\boldsymbol \phi_1\) equals the column eigenvector corresponding with the largest eigenvalue of \(\boldsymbol \Sigma\)
The fraction of the original variance kept by the \(M\) principal component
\[ R^2 = \frac{\sum _{i=1}^{M} \lambda _i}{\sum _{j=1}^{p}\lambda _j} \]
Not all methodology needs scaling, e.g. linear regression
PCA usually does
Each Principal Component loading vector is unique, up to a sign flip.
Flipping the sign has no effect as the direction of the PC does not change.
Each Principal Component loading vector is unique, up to a sign flip.
Flipping the sign has no effect as the direction of the PC does not change.
The approximation below will not change because the score vector sign will compensate the flip on the loading vector
\[x_{ij} \approx \sum_{m=1}^{M} z_{im} \phi_{jm}\]
Let's assume the variables are centered to have mean zero.
Total variance present in a dataset:
\[\sum_{j=1}^p Var(X_j) = \sum_{j=1}^p \frac{1}{n} \sum _{i=1}^ {n}x_{ij}^2\]
Let's assume the variables are centered to have mean zero.
Total variance present in a dataset:
\[\sum_{j=1}^p Var(X_j) = \sum_{j=1}^p \frac{1}{n} \sum _{i=1}^ {n}x_{ij}^2\]
\[\frac{1}{n} \sum_{i=1}^{n}z_{im}^2 = \frac{1}{n} \sum_{i=1}^{n}\left(\sum_{j=1}^{p}\phi_{jm}x_{ij}\right)^2\]
\[\frac{\sum_{i=1}^{n}\left(\sum_{j=1}^{p}\phi_{jm}x_{ij}\right)^2}{\sum_{j=1}^p \sum _{i=1}^ {n}x_{ij}^2}\]
\[\frac{\sum_{i=1}^{n}\left(\sum_{j=1}^{p}\phi_{jm}x_{ij}\right)^2}{\sum_{j=1}^p \sum _{i=1}^ {n}x_{ij}^2}\]
\[ \frac{\sum _{i=1}^{M} \lambda _i}{\sum _{j=1}^{p}\lambda _j} \]
There is no objective answer
Adhoc, by looking at the PVE graph
There is no objective answer
Adhoc, by looking at the PVE graph
Lab 1: Principal component analysis applied to the USArrests
dataset.
Extra: PCA on the New York Times stories