Spring 2018
We have the predictors \(\boldsymbol x_1, \boldsymbol x_2, ..., \boldsymbol x_n\) associated with the responses \(y_1, y_2, ..., y_n\).
We assume there is a function \(f\) such that \[ y_i = f(\boldsymbol x_i) + \epsilon\] where the noise \(\epsilon\) has zero mean and variance \(\sigma ^2\)
We have the predictors \(\boldsymbol x_1, \boldsymbol x_2, ..., \boldsymbol x_n\) associated with the responses \(y_1, y_2, ..., y_n\).
We assume there is a function \(f\) such that \[ y_i = f(\boldsymbol x_i) + \epsilon\] where the noise \(\epsilon\) has zero mean and variance \(\sigma ^2\)
We want to find a model \(\hat{f}\), such that \((y_i - \hat{y}_i)\) is minimal for both in-sample and out-sample \(\boldsymbol x_i\), where \(\hat{y}_i = \hat{f}(\boldsymbol x_i)\)
Show that
\[ E[(y_i - \hat{y}_i)^2] = \text{Bias}[\hat{y}_i]^2 + \text{Var}[\hat{y}_i] + \sigma^2 \]
assuming the notation used in the previous two slides.
\[ RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
\[ RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
\[ MSE = \frac{RSS}{n} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n}\]
\[ RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]
\[ MSE = \frac{RSS}{n} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n}\]
\[ R^2 = \frac{TSS - RSS}{TSS} = 1 - \frac{RSS}{TSS} = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y}_i)^2}\]
Training error: Average loss over the training sample \[ \text{MSE}_\text{train} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n} \]
Test error: training sample \(\mathcal{T}\), expectation wrt \(P(\boldsymbol X, Y)\) \[ \text{MSE}_\text{test} = E[(Y - \hat{f}(\boldsymbol X))^2|\mathcal{T}] \]
Training error: Average loss over the training sample \[ \text{MSE}_\text{train} = \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{n} \]
Test error: training sample \(\mathcal{T}\), expectation wrt \(P(\boldsymbol X, Y)\) \[ \text{MSE}_\text{test} = E[(Y - \hat{f}(\boldsymbol X))^2|\mathcal{T}] \]
\[ Y = \beta_0 + \beta_1X_1 + ... + \beta_pX_p + \epsilon \] or in matrix form:
\[ \boldsymbol{Y} = \boldsymbol{X} \boldsymbol \beta + \boldsymbol{\epsilon} \]
\[ Y = \beta_0 + \beta_1X_1 + ... + \beta_pX_p + \epsilon \] or in matrix form:
\[ \boldsymbol{Y} = \boldsymbol{X} \boldsymbol \beta + \boldsymbol{\epsilon} \]
\[ RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum _i^n (y_i - \boldsymbol x_i^T \boldsymbol \beta) = (\boldsymbol Y - \boldsymbol X \hat{\boldsymbol \beta})^T(\boldsymbol Y - \boldsymbol X \hat{\boldsymbol \beta})\]
\[ Y = \beta_0 + \beta_1X_1 + ... + \beta_pX_p + \epsilon \] or in matrix form:
\[ \boldsymbol{Y} = \boldsymbol{X} \boldsymbol \beta + \boldsymbol{\epsilon} \]
\[ RSS = \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \sum _i^n (y_i - \boldsymbol x_i^T \boldsymbol \beta) = (\boldsymbol Y - \boldsymbol X \hat{\boldsymbol \beta})^T(\boldsymbol Y - \boldsymbol X \hat{\boldsymbol \beta})\]
\[ \hat{\boldsymbol \beta} =(\boldsymbol X^T \boldsymbol X)^{-1} \boldsymbol X^T \boldsymbol Y\]
Write R code to create a similar representation of the Credit data figure of the previous slide. That is, try to recreate a similar plot in R.
Improve linear models prediction accuracy and/or model interpretability by replacing least square fitting with some alternative fitting procedures.
… when using standard linear models
Assuming true relationship is approx. linear: low bias.
… when using standard linear models
Assuming true relationship is approx. linear: low bias.
By constraining or shrinking the estimated coefficients:
Show why fitting a standard linear regression model when \(n < p\) is not an option.
Some or many of the variables might be irrelevant wrt the response variable
Some of the discussed approaches lead to automatically performing feature/variable selection.
We will cover the following alternatives to using least squares to fit linear models
We will cover the following alternatives to using least squares to fit linear models
We will cover the following alternatives to using least squares to fit linear models
Identifying a subset of the \(p\) predictors that we believe to be related to the response.
Identifying a subset of the \(p\) predictors that we believe to be related to the response.
Outline:
Number of models considered:
\[{{p}\choose{1}} + {{p}\choose{2}} + ... + {{p}\choose{2}} = 2^p \]
regsubsets()
of the leaps
library, similar to what was done in Lab 1 of the book.Add and/or remove one predictor at a time.
Add and/or remove one predictor at a time.
Methods outline:
Similarly to forward selection, variables are added to the model sequentially.
However, after adding each new variable, the method may also remove any variables that no longer provide an improvement in the model fit.
Better model space exploration while retaining computational advantages of stepwise selection.
regsubsets()
of the leaps
libraryThe ridge regression coefs \(\beta^R\) are the ones that minimize
\[RSS + \lambda \sum _{j=1}^p \beta_j^2\]
with \(\lambda > 0\) being a tuning parameter.
The ridge regression coefs \(\beta^R\) are the ones that minimize
\[RSS + \lambda \sum _{j=1}^p \beta_j^2\]
with \(\lambda > 0\) being a tuning parameter.
\[ \tilde{x}_{ij} = \frac{x_{ij}}{\sqrt{\frac{1}{n}\sum _{i=1}^{n}(x_{ij} - \bar{x}_j)^2}} \]
\[RSS + \lambda \sum _{j=1}^p |\beta_j|\]
with \(\lambda > 0\) being a tuning parameter.
\[RSS + \lambda \sum _{j=1}^p |\beta_j|\]
with \(\lambda > 0\) being a tuning parameter.
Lasso also shrinks the coefficients towards zero
In addition, the \(\mathcal{l}_1\) penalty has the effect of forcing some of the coefficients to be exactly zero when \(\lambda\) is large enough
\[RSS + \lambda \sum _{j=1}^p |\beta_j|\]
with \(\lambda > 0\) being a tuning parameter.
Lasso also shrinks the coefficients towards zero
In addition, the \(\mathcal{l}_1\) penalty has the effect of forcing some of the coefficients to be exactly zero when \(\lambda\) is large enough
A geometric explanation will be presented in a future slide.
\[ \underset{\beta}{\text{minimize}} \left\{\sum _{i=1}^n \left (y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij} \right)^2 \right\}\text{ subject to }\sum_{j=1}^p |\beta_j| \le s\]
\[ \underset{\beta}{\text{minimize}} \left\{\sum _{i=1}^n \left (y_i - \beta_0 - \sum_{j=1}^p \beta_jx_{ij} \right)^2 \right\}\text{ subject to }\sum_{j=1}^p \beta_j^2 \le s\]
Neither is universally better than the other
One expects lasso to perform better for cases where a relatively small number of predictors have coefs that are very small or zero
Neither is universally better than the other
One expects lasso to perform better for cases where a relatively small number of predictors have coefs that are very small or zero
One expects ridge to be better when the response is a function of many predictors, all with roughly equal size
Neither is universally better than the other
One expects lasso to perform better for cases where a relatively small number of predictors have coefs that are very small or zero
One expects ridge to be better when the response is a function of many predictors, all with roughly equal size
Hard to know a priori, techniques such as CV required
Pick \(\lambda\) for which the cross-validation error is smallest.
re-fit using all of the available observations and the selected value of \(\lambda\).
Through out the recommended exercises, you have applied the following techniques to the Credit Dataset:
Which method worked best for this particular dataset? Elaborate.
Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1). NIH Public Access: 1.