Spring 2018
\[ Z_m = \sum_{j=1}^{p} \phi_{jm} X_j\] for \(m=1, ..., M\), \(j=1, ..., p\) and \(M < p\)
\[ Z_m = \sum_{j=1}^{p} \phi_{jm} X_j\] for \(m=1, ..., M\), \(j=1, ..., p\) and \(M < p\)
\[ y_i = \theta _0 + \sum_{m=1}^{M} \theta_m z_{im} + \epsilon _i, \quad i=1,...,n\]
\[ \beta_j = \sum_{m=1}^{M} \theta_m \phi_{jm}\]
\[ \beta_j = \sum_{m=1}^{M} \theta_m \phi_{jm}\]
So dimension reduction serves to constrain the coefficients of a standard linear regression
This constrain increase the bias but is useful in situations where the variance is high
Discussed in greater detail in Chapter 10 about unsupervised learning
Discussed in greater detail in Chapter 10 about unsupervised learning
PCA is a (unsupervised) technique for reducing the dimension of a \(n \times p\) data matrix \(X\).
We want to create a \(n \times M\) matrix \(Z\), with \(M < p\).
The column \(Z_m\) of \(Z\) is the \(m\)-th principal component.
\[Z_m = \sum_{j=1}^{p} \phi_{jm} X_j \quad \text{subject to} \quad \sum_{j=1}^p \phi_{jm}^2 = 1\]
We want to create a \(n \times M\) matrix \(Z\), with \(M < p\).
The column \(Z_m\) of \(Z\) is the \(m\)-th principal component.
\[Z_m = \sum_{j=1}^{p} \phi_{jm} X_j \quad \text{subject to} \quad \sum_{j=1}^p \phi_{jm}^2 = 1\]
And so on …
Let \(\boldsymbol X\) be a matrix with dimension \(n \times p\).
Each column represent a vector of predictors.
Let \(\boldsymbol X\) be a matrix with dimension \(n \times p\).
Each column represent a vector of predictors.
Assume \(\boldsymbol \Sigma\) to be the covariance matrix associated with \(\boldsymbol X\).
Let \(\boldsymbol X\) be a matrix with dimension \(n \times p\).
Each column represent a vector of predictors.
Assume \(\boldsymbol \Sigma\) to be the covariance matrix associated with \(\boldsymbol X\).
We want \(\boldsymbol Z_1 = \boldsymbol \phi_1 \boldsymbol X\), subject to \(||\boldsymbol \phi_1||_2 = 1\)
We want \(\boldsymbol Z_1\) to have the highest possible variance, \(V(\boldsymbol Z_1) = \boldsymbol \phi_1^T \Sigma \boldsymbol \phi_1\)
We want \(\boldsymbol Z_1 = \boldsymbol \phi_1 \boldsymbol X\), subject to \(||\boldsymbol \phi_1||_2 = 1\)
We want \(\boldsymbol Z_1\) to have the highest possible variance, \(V(\boldsymbol Z_1) = \boldsymbol \phi_1^T \Sigma \boldsymbol \phi_1\)
\(\boldsymbol \phi_1\) equals the column eigenvector corresponding with the largest eigenvalue of \(\boldsymbol \Sigma\)
We want \(\boldsymbol Z_1 = \boldsymbol \phi_1 \boldsymbol X\), subject to \(||\boldsymbol \phi_1||_2 = 1\)
We want \(\boldsymbol Z_1\) to have the highest possible variance, \(V(\boldsymbol Z_1) = \boldsymbol \phi_1^T \Sigma \boldsymbol \phi_1\)
\(\boldsymbol \phi_1\) equals the column eigenvector corresponding with the largest eigenvalue of \(\boldsymbol \Sigma\)
The fraction of the original variance kept by the \(M\) principal component
\[ R^2 = \frac{\sum _{i=1}^{M} \lambda _i}{\sum _{j=1}^{p}\lambda _j} \]
Singular Value Decomposition (SVD) is more numerically stable than eigendecomposition and is usually used in practice.
How many principal components to retain will depend on the specific application.
Plotting \((1-R^2)\) versus the number of components helps selecting how many components to pick
How many principal components should we use for the Credit Dataset? Justify?
Principal component analysis (PCA) is a dimensionality reduction technique
Principal component analysis (PCA) is a dimensionality reduction technique
Principal component analysis (PCA) is a dimensionality reduction technique
Key assumptions: A small number of principal components suffice to explain:
Key assumptions: A small number of principal components suffice to explain:
Apply PCR on the Credit dataset and compare the results with the methods covered in Lecture 1.
Apply PLS on the Credit dataset and compare the results with the methods covered in Lecture 1 and PCR.
PLS, PCR and ridge regression tend to behave similarly.
Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps.
PLS, PCR and ridge regression tend to behave similarly.
Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps.
Lasso falls somewhere between ridge regression and best subset regression, and enjoys some of the properties of each.
High dimension problems: \(n < p\)
More common nowadays
In general, adding additional signal features helps (smaller test set errors)
In general, adding additional signal features helps (smaller test set errors)
New technologies that allow for the collection of measurements for thousands or millions of features are a double-edged sword
In the high-dimensional setting, the multicollinearity problem is extreme
In the high-dimensional setting, the multicollinearity problem is extreme