Spring 2018

Previous lecture

Subset selection and shrinkage methods

  • Subset selection and shrinkage methods have controlled variance in two ways:
    • Using a subset of the original predictors.
    • Shrinking their coefficients towards zero.
  • Those methods use the original (possibly standardized) predictors \(X_1\), …, \(X_p\).

Dimension reduction methods

Dimension reduction methods

  • Transform the original predictors

\[ Z_m = \sum_{j=1}^{p} \phi_{jm} X_j\] for \(m=1, ..., M\), \(j=1, ..., p\) and \(M < p\)

Dimension reduction methods

  • Transform the original predictors

\[ Z_m = \sum_{j=1}^{p} \phi_{jm} X_j\] for \(m=1, ..., M\), \(j=1, ..., p\) and \(M < p\)

  • Fit least square using the transformed predictors

\[ y_i = \theta _0 + \sum_{m=1}^{M} \theta_m z_{im} + \epsilon _i, \quad i=1,...,n\]

Constrained interpretation

  • It can be shown that

\[ \beta_j = \sum_{m=1}^{M} \theta_m \phi_{jm}\]

Constrained interpretation

  • It can be shown that

\[ \beta_j = \sum_{m=1}^{M} \theta_m \phi_{jm}\]

  • So dimension reduction serves to constrain the coefficients of a standard linear regression

  • This constrain increase the bias but is useful in situations where the variance is high

Outline

  • We will cover two approaches to dimensionality reduction:
    • Principal Components
    • Partial Least Squares

Principal Component Analysis (PCA)

Principal Component Analysis (PCA)

  • Discussed in greater detail in Chapter 10 about unsupervised learning

  • Focus in this lecture is how it can be applied for regression.
    • That is, in a supervised setting.

Principal Component Analysis (PCA)

  • Discussed in greater detail in Chapter 10 about unsupervised learning

  • Focus in this lecture is how it can be applied for regression.
    • That is, in a supervised setting.
  • PCA is a (unsupervised) technique for reducing the dimension of a \(n \times p\) data matrix \(X\).

Principal Component Analysis (PCA)

  • We want to create a \(n \times M\) matrix \(Z\), with \(M < p\).

  • The column \(Z_m\) of \(Z\) is the \(m\)-th principal component.

\[Z_m = \sum_{j=1}^{p} \phi_{jm} X_j \quad \text{subject to} \quad \sum_{j=1}^p \phi_{jm}^2 = 1\]

Principal Component Analysis (PCA)

  • We want to create a \(n \times M\) matrix \(Z\), with \(M < p\).

  • The column \(Z_m\) of \(Z\) is the \(m\)-th principal component.

\[Z_m = \sum_{j=1}^{p} \phi_{jm} X_j \quad \text{subject to} \quad \sum_{j=1}^p \phi_{jm}^2 = 1\]

  • We want \(Z_1\) to have the highest possible variance.
    • That is, take the direction of the data where the observations vary the most.
    • Without the constrain we could get higher variance by increasing \(\boldsymbol \phi_j\)

Principal Component Analysis (PCA)

  • \(Z_2\) should be uncorrelated to \(Z_1\), and have the highest variance, subject to this constrain.
    • The direction of \(Z_1\) must be perpendicular (or orthogonal) to the direction of \(Z_2\)

Principal Component Analysis (PCA)

  • \(Z_2\) should be uncorrelated to \(Z_1\), and have the highest variance, subject to this constrain.
    • The direction of \(Z_1\) must be perpendicular (or orthogonal) to the direction of \(Z_2\)
  • And so on …

  • We can construct up to \(p\) PCs that way.
    • In which case we have captured all the variability contained in the data
    • We have created a set of orthogonal predictors
    • But have not accomplished dimensionality reduction

PCA Example - Ad spending

PCA Example - Ad spending (II)


PCA Example - Ad spending (III)


PCA Example - Ad spending (IV)


PCA - General setup

  • Let \(\boldsymbol X\) be a matrix with dimension \(n \times p\).

  • Each column represent a vector of predictors.

PCA - General setup

  • Let \(\boldsymbol X\) be a matrix with dimension \(n \times p\).

  • Each column represent a vector of predictors.

  • Assume \(\boldsymbol \Sigma\) to be the covariance matrix associated with \(\boldsymbol X\).

PCA - General setup

  • Let \(\boldsymbol X\) be a matrix with dimension \(n \times p\).

  • Each column represent a vector of predictors.

  • Assume \(\boldsymbol \Sigma\) to be the covariance matrix associated with \(\boldsymbol X\).

  • Since \(\Sigma\) is a non-negative definite matrix, it has an eigen-decomposition \[ \boldsymbol {\Sigma} = \boldsymbol C \boldsymbol \Lambda \boldsymbol C^{-1} \]
    • \(\boldsymbol \Lambda = diag(\lambda _1, ..., \lambda _p)\) is a diagonal matrix of (non-negative) eigenvalues in decreasing order,
    • \(\boldsymbol C\) is a matrix where its columns are formed by the eigenvectors of \(\boldsymbol {\Sigma}\).

PCA - General setup (II)

  • We want \(\boldsymbol Z_1 = \boldsymbol \phi_1 \boldsymbol X\), subject to \(||\boldsymbol \phi_1||_2 = 1\)

  • We want \(\boldsymbol Z_1\) to have the highest possible variance, \(V(\boldsymbol Z_1) = \boldsymbol \phi_1^T \Sigma \boldsymbol \phi_1\)

PCA - General setup (II)

  • We want \(\boldsymbol Z_1 = \boldsymbol \phi_1 \boldsymbol X\), subject to \(||\boldsymbol \phi_1||_2 = 1\)

  • We want \(\boldsymbol Z_1\) to have the highest possible variance, \(V(\boldsymbol Z_1) = \boldsymbol \phi_1^T \Sigma \boldsymbol \phi_1\)

  • \(\boldsymbol \phi_1\) equals the column eigenvector corresponding with the largest eigenvalue of \(\boldsymbol \Sigma\)

PCA - General setup (II)

  • We want \(\boldsymbol Z_1 = \boldsymbol \phi_1 \boldsymbol X\), subject to \(||\boldsymbol \phi_1||_2 = 1\)

  • We want \(\boldsymbol Z_1\) to have the highest possible variance, \(V(\boldsymbol Z_1) = \boldsymbol \phi_1^T \Sigma \boldsymbol \phi_1\)

  • \(\boldsymbol \phi_1\) equals the column eigenvector corresponding with the largest eigenvalue of \(\boldsymbol \Sigma\)

  • The fraction of the original variance kept by the \(M\) principal component

\[ R^2 = \frac{\sum _{i=1}^{M} \lambda _i}{\sum _{j=1}^{p}\lambda _j} \]

PCA - general advice

  • PCA is not scale invariant,
    • standardize all the \(p\) variables before applying PCA.

PCA - general advice

  • PCA is not scale invariant,
    • standardize all the \(p\) variables before applying PCA.
  • Singular Value Decomposition (SVD) is more numerically stable than eigendecomposition and is usually used in practice.

PCA - general advice

  • PCA is not scale invariant,
    • standardize all the \(p\) variables before applying PCA.
  • Singular Value Decomposition (SVD) is more numerically stable than eigendecomposition and is usually used in practice.

  • How many principal components to retain will depend on the specific application.

  • Plotting \((1-R^2)\) versus the number of components helps selecting how many components to pick

Recommended exercise 10

PCA - Summary

  • Principal component analysis (PCA) is a dimensionality reduction technique

  • Useful for:
    • Our ability to visualize data is limited to 2 or 3 dimensions.

PCA - Summary

  • Principal component analysis (PCA) is a dimensionality reduction technique

  • Useful for:
    • Our ability to visualize data is limited to 2 or 3 dimensions.
    • Lower dimension can reduce numerical algorithms computational time.

PCA - Summary

  • Principal component analysis (PCA) is a dimensionality reduction technique

  • Useful for:
    • Our ability to visualize data is limited to 2 or 3 dimensions.
    • Lower dimension can reduce numerical algorithms computational time.
    • Many statistical models suffer from high correlation between covariates

Principal Components Regression (PCR)

Principal Components Regression (PCR)

  • Principal Components Regression involves:
    • Constructing the first \(M\) principal components \(\boldsymbol Z_1, ..., \boldsymbol Z_M\)
    • Using these components as the predictors in a standard linear regression model

Principal Components Regression (PCR)

  • Principal Components Regression involves:
    • Constructing the first \(M\) principal components \(\boldsymbol Z_1, ..., \boldsymbol Z_M\)
    • Using these components as the predictors in a standard linear regression model
  • Key assumptions: A small number of principal components suffice to explain:

    1. Most of the variability in the data.
    2. The relationship with the response.

Principal Components Regression (PCR)

  • Principal Components Regression involves:
    • Constructing the first \(M\) principal components \(\boldsymbol Z_1, ..., \boldsymbol Z_M\)
    • Using these components as the predictors in a standard linear regression model
  • Key assumptions: A small number of principal components suffice to explain:

    1. Most of the variability in the data.
    2. The relationship with the response.
  • The assumptions above are not guaranteed to hold in every case.
    • This is true specially for assumption \(2\) above.
    • Since the PCs are selected via unsupervised learning.

Example: PCR vs. Lasso and Ridge (Simulated data)

Example: PCR vs. Lasso and Ridge (Simulated data)

  • PCR performed well on simulated data, recovering the need for \(M=5\)
    • However, results are only slightly better than lasso and very similar to Ridge.

Example: PCR vs. Lasso and Ridge (Simulated data)

  • PCR performed well on simulated data, recovering the need for \(M=5\)
    • However, results are only slightly better than lasso and very similar to Ridge.
  • Similar to Ridge, PCR does not perform feature selection
    • PCs are linear combination of all predictors

Example: PCR vs. Lasso and Ridge (Simulated data)

  • PCR performed well on simulated data, recovering the need for \(M=5\)
    • However, results are only slightly better than lasso and very similar to Ridge.
  • Similar to Ridge, PCR does not perform feature selection
    • PCs are linear combination of all predictors
  • PCR can be seen as discretized version of Ridge regression.
    • Ridge shrinks coefs. of the PCs by \(\lambda_j^2/(\lambda_j^2 + \lambda)\)
    • Higher pressure on less important PCs
    • PCR discards the \(p - M\) smallest eigenvalue components.

Example: Shrinkage Factor

Example: PCR (Credit Data)

Recommended exercise 11

PCR (Drawback)

  • Dimensionality reduction is done via an unsurpevised method (PCA)
    • No guarantee that the directions that best explain the predictors will also be the best directions to use for predicting the response.

Partial Least Squares (PLS)

Partial Least Squares (PLS)

  • PLS works similar to PCR
    • Dimension reduction: \(Z_1, ..., Z_M\), \(M < p\)
    • \(Z_i\) linear combination of original predictors.
    • Apply standard linear model using \(Z_1, ..., Z_M\) as predictors.

Partial Least Squares (PLS)

  • PLS works similar to PCR
    • Dimension reduction: \(Z_1, ..., Z_M\), \(M < p\)
    • \(Z_i\) linear combination of original predictors.
    • Apply standard linear model using \(Z_1, ..., Z_M\) as predictors.
  • But it uses the response \(Y\) in order to identify new features
    • attempts to find directions that help explain both the response and the predictors.

Partial Least Squares (Algorithm)

  • \(Z_1 = \sum _{i=1}^p \phi_{j1} X_j\)
    • \(\phi_{j1}\) is the coefficient from the simple linear regression of \(Y\) onto \(X_j\).
    • this coefficient is proportional to the correlation between \(Y\) and \(X_j\).
    • PLS puts highest weight on the variables that are most strongly related to the response.

Partial Least Squares (Algorithm)

  • \(Z_1 = \sum _{i=1}^p \phi_{j1} X_j\)
    • \(\phi_{j1}\) is the coefficient from the simple linear regression of \(Y\) onto \(X_j\).
    • this coefficient is proportional to the correlation between \(Y\) and \(X_j\).
    • PLS puts highest weight on the variables that are most strongly related to the response.
  • To obtain the second PLS direction, \(Z_2\):
    • We regress each variable on \(Z_1\) and take the residuals
    • The residuals are remained info not explained by \(Z_1\)
    • We the compute \(Z_2\) using this orthogonalized data, similarly to \(Z_1\).

Partial Least Squares (Algorithm)

  • \(Z_1 = \sum _{i=1}^p \phi_{j1} X_j\)
    • \(\phi_{j1}\) is the coefficient from the simple linear regression of \(Y\) onto \(X_j\).
    • this coefficient is proportional to the correlation between \(Y\) and \(X_j\).
    • PLS puts highest weight on the variables that are most strongly related to the response.
  • To obtain the second PLS direction, \(Z_2\):
    • We regress each variable on \(Z_1\) and take the residuals
    • The residuals are remained info not explained by \(Z_1\)
    • We the compute \(Z_2\) using this orthogonalized data, similarly to \(Z_1\).
  • We can repeat this iteration process \(M\) times to get \(Z_1, ..., Z_M\).

Recommended exercise 12

Partial Least Squares (Performance)

  • In practice, PLS often performs no better than ridge regression or PCR.
    • Supervised dimension reduction of PLS can reduce bias.
    • It also has the potential to increase variance.

In summary

  • PLS, PCR and ridge regression tend to behave similarly.

In summary

  • PLS, PCR and ridge regression tend to behave similarly.

  • Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps.

In summary

  • PLS, PCR and ridge regression tend to behave similarly.

  • Ridge regression may be preferred because it shrinks smoothly, rather than in discrete steps.

  • Lasso falls somewhere between ridge regression and best subset regression, and enjoys some of the properties of each.

Considerations in high dimensions

High dimension

  • High dimension problems: \(n < p\)

  • More common nowadays

High dimension issues (Example)

  • Standard linear regression cannot be applied.
    • Perfect fit to the data, regardless of relationship
    • Unfortunately, the \(C_p\), AIC, and BIC approaches are problematic (hard to estimate \(\sigma^2\))

High dimension issues (Example)

  • Standard linear regression cannot be applied.
    • Perfect fit to the data, regardless of relationship
    • Unfortunately, the \(C_p\), AIC, and BIC approaches are problematic (hard to estimate \(\sigma^2\))

Noise predictors

  • The test error tends to increase as the dimensionality of the problem
    • Unless the additional features are truly associated with the response.

The danger of too many features

  • In general, adding additional signal features helps (smaller test set errors)

The danger of too many features

  • In general, adding additional signal features helps (smaller test set errors)

  • However, adding noise features that are not truly associated with the response increases test set error.
    • Noise features exacerbating the risk of overfitting
    • Previous example shows that regularizations does not eliminate the problem

The danger of too many features

  • In general, adding additional signal features helps (smaller test set errors)

  • However, adding noise features that are not truly associated with the response increases test set error.
    • Noise features exacerbating the risk of overfitting
    • Previous example shows that regularizations does not eliminate the problem
  • New technologies that allow for the collection of measurements for thousands or millions of features are a double-edged sword

Interpreting results in high dimension

  • In the high-dimensional setting, the multicollinearity problem is extreme

Interpreting results in high dimension

  • In the high-dimensional setting, the multicollinearity problem is extreme

  • Essentially, this means:
    • We can never know exactly which variables (if any) truly are predictive of the outcome.
    • We can never identify the best coefficients for use in the regression.

Interpreting results in high dimension

  • In the high-dimensional setting, the multicollinearity problem is extreme

  • Essentially, this means:
    • We can never know exactly which variables (if any) truly are predictive of the outcome.
    • We can never identify the best coefficients for use in the regression.
    • At most, we can hope to assign large regression coefficients to variables that are correlated with the variables that truly arec predictive of the outcome.
    • We will find one of possibly many suitable predictive models.

The end



Thank you for showing up