(08.04: minor adjustments)

Overview

course content and learning outcome
reading list
overview of modules and core course topics (with exam type questions)
exam: details on the digital exam, different types of exam questions and exam preparation
suggestions for statistics-related courses

Some of the figures in this presentation are taken from (or are inspired by) “An Introduction to Statistical Learning, with applications in R” (Springer, 2013) with permission from the authors: G. James, D. Witten, T. Hastie and R. Tibshirani.

Added after class

Class notes M12L1notes.pdf
Mettes yellow sheet - there ARE misprints - tell Mette when you find a misprint: SummingUp2019.pdf

Course content

Statistical learning, multiple linear regression, classification, resampling methods, model selection/regularization, non-linearity, support vector machines, tree-based methods, unsupervised methods, neural nets.

Learning outcome

Knowledge. The student has knowledge about the most popular statistical learning models and methods that are used for prediction and inference in science and technology. Emphasis is on regression- and classification-type statistical models.
Skills. The student knows, based on an existing data set, how to choose a suitable statistical model, apply sound statistical methods, and perform the analyses using statistical software. The student knows how to present the results from the statistical analyses, and which conclusions can be drawn from the analyses.

And you got to be an expert in using the R language and writing R Markdown reports?

Final reading list

Textbook: James, Witten, Hastie, Tibshirani (2013): “An Introduction to Statistical Learning”.

the whole textbook (436 pages)
the 12 module pages (remark: module 11 not in the book, and module 1+12 have no “new” material)
the compulsory exercises
- Compulsory1 Short solutions
- Compulsory2 Short solutions

The short solutions will be available in April and May.

Introduction
Statistical learning*
Linear regression*
Classification*
Resampling methods
Linear model selection and regularization (several files, links below)
Moving beyond linearity*
Tree-based methods
Support vector machines
Unsupervised learning (several files, links below)
Neural networks*
Summing up

Remark: * means that some material is added as compared to the textbook

Core of the course

build toolbox: how to analyse data (that are not too complex)

supervised and unsupervised learning
supervised: regression and classification
- examples of regression and classification type problems
- how complex a model to get the best fit? flexiblity/overfitting/underfitting.
- the bias-variance trade-off
- how to find the perfect fit - validation and cross-validation (or AIC-type solutions)
- how to compare different solutions
- how to evaluate the fit - on new unseen data
unsupervised: how to find structure or groupings in data?

and of cause all the methods (with underlying models) to perform regression, classification and unsupervised learning. We have gained some theoretical understanding, but in some cases deeper theoretical background and understanding of the models is provided in other statistics courses.

The modules

Here we list topics and possible exam-type questions/problems. In addition, the recommended exercises are useful to work on - and also the exercises in the textbook.

1. Introduction

Topics in Module 1

Examples, the modules, required background in statistics and
introduction to R

2. Statistical learning

and solutions to RecEx

Topics in Module 2

Model complexity
- Prediction vs. interpretation.
- Parametric vs. nonparametric.
- Inflexible vs. flexible.
- Overfitting vs. underfitting
Supervised vs. unsupervised.
Regression and classification.
Loss functions: quadratic and 0/1 loss.
Bias-variance trade-off (polynomial example): mean squared error, training and test set.
Classification: the Bayes and KNN-classifier
Vectors and matrices, rules for mean and covariances, the multivariate normal distribution.
Model complexity and the bias-variance trade-off is important in “all” subsequent modules.

Questions/Problems:

Exam in 2018: Problem 2
Compulsory1 in 2019: Problem 3.
Compulsory1 in 2018: Problem 1.
What are differences between a supervised and an unsupervised method? List one method of each type and explain briefly which problem they can solve.
What are the two main types of supervised methods discussed in this course, and how do they differ? List two methods of each type and explain briefly how the two methods are related.

3. Linear regression

and solutions to RecEx

Topics in Module 3

Examples: Munich rent index, ozone, SLID, Framingham heart disease, Boston housing prices, auto.
The classical normal linear regression model on vector/matrix form.
Parameter estimators and distribution thereof. Model fit.
Confidence intervals, hypothesis tests, and interpreting R-output from regression.
Qualitative covariates, interactions.
This module is a stepping stone for all subsequent uses of regression in Modules 6, 7, 8, and 11.

Questions/Problems:

Compulsory1 in 2019: Problem 1.
Compulsory1 in 2018: Problem 2.
Theoretical questions are referred to TMA4267 Linear statistical models, but basic knowledge and interpretation of lm output important.
Write down the classical normal multiple regression model in vector and matrix notation. Specify dimensions and explain your notation. Also write down the estimator for the regression coefficients. What is the distribution of this estimator?
Output from summary.lm presented - maybe with question marks in place of number - and you explain and calculate. Interpret two top residual plots!
What is the “bias-variance decomposition”? Is it applicable to all choices of loss functions? Write down the derivation for quadratic loss for a regression problem at \({\bf x}_0\). Explain your notation.

## 
## Call:
## lm(formula = -1/sqrt(SYSBP) ~ ., data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0207366 -0.0039157 -0.0000304  0.0038293  0.0189747 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.103e-01  1.383e-03 -79.745  < 2e-16 ***
## SEX         -2.989e-04  2.390e-04  -1.251 0.211176    
## AGE          2.378e-04  1.434e-05  16.586  < 2e-16 ***
## CURSMOKE    -2.504e-04  2.527e-04  -0.991 0.321723    
## BMI          3.087e-04  2.955e-05  10.447  < 2e-16 ***
## TOTCHOL      9.288e-06  2.602e-06   3.569 0.000365 ***
## BPMEDS       5.469e-03  3.265e-04  16.748  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005819 on 2593 degrees of freedom
## Multiple R-squared:  0.2494, Adjusted R-squared:  0.2476 
## F-statistic: 143.6 on 6 and 2593 DF,  p-value: < 2.2e-16

2018: Compulsory 2, Problem 1

4. Classification

and solutions to RecEx (Mainly discussed the two-class problem in this course)

Topics in Module 4

Examples: South African heart disease, wine, German credit data, IMDB movie review, MNIST digit classification, iris plants.
Bayes classifier: classify to the most probable class gives the minimize the expected 0/1 loss. We usually do not know the probability of each class for each input. The Bayes optimal boundary is the boundary for the Bayes classifier and the error rate (on a test set) for the Bayes classifier is the Bayes error rate. Related to the irreducible error (but bias-variance decomposition is for quadratic loss).

Two paradigms (not in textbook):
- diagnostic (directly estimating the posterior distribution for the classes)
- sampling (estimating class prior probabilities and class conditional distribution and then putting together with Bayes rule)
LDA and QDA: sampling paradigm. Multivariate normal class densities with common covariance (LDA) or class specific covariance (QDA). Class boundaries will be linear (LDA) or quadratic (QDA). Handle easily more than two classes.
KNN: diagnostic paradigm. Formula for posterior class probability. Overfitting/underfitting and flexibility of class boundary as a function of \(K\). Non-linear class boundaries. Handle easily more than two classes.

Logistic regression: mainly used for classification (even though the name is regression). Diagnostic paradigm. Logistic (sigmoid) function and linear predictor. Interpretation of regression coefficients using odds. Linear class boundaries. Two classes.
Evaluation with confusion matrix, understanding of ROC-curve and AUC.
Theory: both LDA and logistic regression.
This module is a stepping stone module for subsequent use of classification in Modules 8, 9 and 11.

Figure caption: The link function plotted against the probability using the Beetle Mortality dataset from R with parameters \(\beta_0 = -60.7\) and \(\beta_1=34.3\). The original value for \(\eta\) and \(\pi\) are included (\(\eta = 0, \, \pi = 0.5)\), as well as the cases in which the odds increase (\(x_{i1} \rightarrow x_{i1} +1, \, \beta_1 > 0, \, \eta=1, \, \pi=0.729\)), and decrease (\(x_{i1} \rightarrow x_{i1} -1, \, \beta_1 < 0, \, \eta=-1, \, \pi=0.27\)). Figure made by Dag Johnsrud Kristiansen.

Questions/Problems:

Exam in 2018: Problem 3 and 4
Compulsory1 in 2019: Problem 2
Compulsory2 in 2019: Problem 3
Compulsory1 in 2018: Problem 3
Compulsory3 in 2018: Problem 2a

Which are the two paradigms we have presented for classification? Explain briefly how these two differ and identify which of the classification methods that we have discussed in this course belongs to which paradigm. Describe one method from each paradigm briefly.
Assume we have two classes (class 1 and class 2) and a bivariate input variable (covariate) \({\bf x}\). We now assume that each class conditional distribution is bivariate normal with the same covariance matrix for the two classes. Write down the posterior probability for class 1 (explaining all the parameters that are involved). Then show (yes, derive the result) that the class boundary between class 1 and 2 is linear in the two components of \({\bf x}\). What is the name of this classification method? (This was very similar to the exam question in Problem 3 of the 2018 exam.)
Given parameter estimates for class means and common covariance matrix (numerical values), use LDA to predict the class of a new observation. (Then pdf in multivariate normal must be given.)

Logistic regression is a classification method for two classes, where the classes are coded \(0\) and \(1\). Assume we have fitted a logistic regression to a data set with covariates \(x_1\) and \(x_2\) and that the fitted model is written \[ \hat{p} = \frac{\exp(1+2\cdot x_1+3\cdot x_2)}{1+\exp(1+2\cdot x_1+3\cdot x_2)}\] What is the interpretation of \(\hat{p}\) (left side) here? What is the interpretation of the regression coefficient \(\hat{\beta}_1=2\)?
What is a confusion matrix? What is it used for? How is the misclassification rate defined?
We have actively used receiver-operator-curve (ROC) and the area under the curve (AUC) in this course. In which types of problems are these used? Explain how a ROC-curve is constructed. If a method gives a AUC of 0.5 when used on a data set, what can you say about this method?
Output from fitting a method is presented - you explain and evaluate output, evaluate classification boundaries, interpret ROC-curve and compare methods.

## 
## Call:
## glm(formula = chd ~ ., family = "binomial", data = heartds)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7781  -0.8213  -0.4387   0.8889   2.5435  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.1507209  1.3082600  -4.701 2.58e-06 ***
## sbp             0.0065040  0.0057304   1.135 0.256374    
## tobacco         0.0793764  0.0266028   2.984 0.002847 ** 
## ldl             0.1739239  0.0596617   2.915 0.003555 ** 
## adiposity       0.0185866  0.0292894   0.635 0.525700    
## famhistPresent  0.9253704  0.2278940   4.061 4.90e-05 ***
## typea           0.0395950  0.0123202   3.214 0.001310 ** 
## obesity        -0.0629099  0.0442477  -1.422 0.155095    
## alcohol         0.0001217  0.0044832   0.027 0.978350    
## age             0.0452253  0.0121298   3.728 0.000193 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 596.11  on 461  degrees of freedom
## Residual deviance: 472.14  on 452  degrees of freedom
## AIC: 492.14
## 
## Number of Fisher Scoring iterations: 5

The bias-variance trade-off in the classification setting?

Bias-variance trade-off is for quadratic loss.
Generalizations exists - but not covered in this course.
For classification we tend to think of the Bayes error rate as some kind of lowest possible error rate - similar to the irreducible error.
In classification we are also focussed on over/under-fitting, and refer to a method that fits the classification boundary closely as having small bias.

5. Resampling methods

and (handwritten) solutions to RecEx

Topics in Module 5

Data rich situation: Training-validation and test set.
Validation set approach
How is cross-validation performed? For regression and for classification.
LOOCV, 5 and 10 fold CV
good and bad issues with validation set, LOOCV, 10-fold CV
bias and variance for k-fold cross-validation - end up with k=5 or k=10 fold as good balance?
selection bias - the right and wrong way to do cross-validation
bootstrapping to estimate uncertainty in statistic (warming up to Module 8)

Questions/Problems:

Exam in 2018: Problem 1
Compulsory 1 in 2019: Problem 2
Compulsory 2 in 2018: Problem 1cd
In a setting where you have access to unlimited amounts of data explain the role of the training set, validation set, and test set. Point to advantages/disadvantages of making such a division of the data set. Your answer should include the words: model complexity, tuning parameters, overfitting, model fit/parameters.
In a setting where you have access to limited amount of data explain how \(k\)-cross-validation can be used for model assessment and model selection. A drawing might be useful.
Explain what is meant by cross-validation. Discuss its use in practice. How does cross-validation relate to the use of training/validation/test sets?

Explain how a bootstrap sample is drawn. What is the probability that an observation in our data set will be a part of a given bootstrap sample?
Assume that we want to fit a regression model. Explain how se can use bootstrapping to estimate the standard deviation of parameters estimates in our model.

6. Linear model selection and regularization:

Lecture 1 and
Lecture 2
and solutions to RecEx

Topics in Module 6:

Model selection: estimate performance of different models to choose the best one.
Model assessment: having chosen a final model, estimate its performance on new data
Model selection by penalizing the training error: AIC, BIC, \(C_p\), Adjusted \(R^2\).
Cross-validation can be used for model selection and assessment.
Subset selection:
- best subset selection
- stepwise model selection

Shrinkage methods
- ridge regression: quadratic L2 penalty added to RSS
- lasso regression: absolute L1 penalty added to RSS
- no penalty on intercept, not scale invariant: center and scale covariates
Dimension reduction methods:
- principal component analysis: eigenvectors, proportion of variance explained, scree plot
- principal component regression
- partial least squares (lightly covered)
High dimensionality issues: multicollinearity, interpretation.

Questions/Problems:

Compulsory 2 in 2019: Problem 1
Compulsory 2 in 2018: Problem 1 and 2
Print-out from best subset selection, explain how this is done and what the best model is if you use BIC. Explain how you instead (of using BIC) can use cross-validation.
We have discussed parametric methods where the parameters are found by minimizing the sum of a loss function and a penalty. Choose one such method, write down the loss and penalty used, and explain how this is related to the bias-variance trade-off.
Interpret figures, explain what you see. What do we call this method?
Best subset and lasso: Exam TMA4267V2016 Problem 2d with solutions
Best subset and lasso: Exam TMA4267V2014 Problem 2c with solutions

ISL 6.7

ISL 6.4

ISL 6.6

Explain how you find the principal components for a given data set, and how these are used in regression. Assume you have \(p\) covariates and \(n\) observations (where \(n>>p\)) and you fit a regression model with the first \(p\) principal components as regressors. How does this compare to fitting a multiple linear regression to the original covariates? What if you instead only use the first \(q\) principal components, where \(q<p\).
MLR, overfitting and principal component regression: Exam TMA4267K2014 Problem 2 with solutions

7. Moving beyond linearity

and solutions to RecEx

Topics in Module 7

Modifications to the multiple linear regression model - when a linear model is not the best choice. Similar techniques can be used for classification, but we only looked at regression. First look at one covariate, combine in “additive model”.
Basis functions: fixed functions of the covariates (no parameters to estimate)
Polynomial regression: multiple linear regression with polynomials as basis functions.
Step functions - piece-wise constants. Like our dummy variable coding of factors.
Regression splines: regional polynomials joined smoothly - neat use of basis functions. Cubic splines very popular.

Smoothing splines: smooth functions - minimizing the RSS with an additional penalty on the second derivative of the curve. Results in a natural cubic spline with knots in the unique values of the covariate. Complexity parameters chosen by AIC (with degrees of freedom) or cross-validation. (UiO mainly AIC, we mainly cross-validation.)
Local regressions: smoothed \(K\)-nearest neighbour with local regression and weighting. In applied areas loess is very popular.
Additive models: combine the above. Sum of (possibly) non-linear instead of linear functions.

Questions/Problems:

Compulsory 2 in 2019: Problem 1b
Compulsory 2 in 2018: Problem 3
What is the difference between a cubic spline and a natural cubic spline? What would you prefer?
A smoothing spline is a function minimizing the RSS and an additional penalty. What type of penalty is this? There is a tuning parameter involved - hos can that be chosen? (Details on the smoother matrix and relationship to ridge is beyond the scope here.)
UiO 2017 Problem 1c with solutions- but we did not focus on degrees of freedom.

ISL 7.3

8. Tree-based methods

and solutions to RecEx

Topics in Module 8

Method applicable both to regression and classification (\(K\) classes) and will give non-linear covariate effects and include interactions between covariates. Based on binary splits of each covariate at a time.
Glossary: root, branches, internal nodes, terminal (leaf) nodes. Tree drawn upside down.
A tree can also be seen as a division of the covariate space into non-overlapping regions.
We build a tree from binary splits in one covariate at the time, chosen to improve some measure of error or impurity. The tree is created by not looking ahead - only at the current best split - thus a greedy strategy.
Criterion to minimize
- Regression: residual sums of squares
- Classification: Gini or cross entropy impurity measure or deviance

When to stop: decided stopping criterion - like minimal decrease in RSS or less than 10 observations in terminal node.
Prediction in terminal nodes:
- Regression: \(\hat{y}=\frac{1}{N_j}\sum_{i: x_i \in R_j} y_i\)
- Classification: majority vote or fraction of each class in a node - and cut-off on probabiity.
Grow full tree, and then prune back using pruning strategy: cost complexity pruning= cost function + penalty times number of terminal notes (hot handled in detail).

From one tree to many trees= forest. Why? To improve prediction (but this will give worse interpretation).
Bagging (bootstrap aggregation): draw \(B\) bootstrap samples and fit one full tree to each, used the average over all trees for prediction.
Random forest: as bagging but only \(m\) (randomly) chosen covariates (out of the \(p\)) are available for selection at each possible split. Rule of thumb for \(m\) is \(\sqrt{p}\) for classificaton and \(p/3\) for regression.
OOB: out-of-bag estimation can be used for model selection - no need for cross-validation.
Variable importance plots: give the total amount of decrease in RSS or Gini index over splits of a predictor - averaged over all B trees. May also be calculated over randomization of OOB.
Boosting: fit one tree with \(d\) splits, make residuals and fit a new tree, adjust residuals partly with new tree - repeat. Three tuning parameters chosen by cross-validation.

Questions/Problems:

Exam in 2018: Problem 4
Compulsory 2 in 2019: Problem 1 and 3.
Compulsory 3 in 2018: Problem 1.
What does it mean that a method is greedy? Mention one greedy method that we have studied and explain why it is greedy.
How do we choose that we perform a split in a tree? What is the natural cost function for regression? For classification we focus on node impurity - explain one possible cost function for node impurity.
Image of tree, explain what you see. Predict the value for a new observation with numerical value given.
Show full tree and pruned tree and results on test set: compare and argument for which of the models to choose.

How do we choose the number of bootstrap samples \(B\) to be used in bagging and random forest? What about boosting?
Why do we not have to use cross-validation to estimate error rates for bagging and random forest? What do we instead use, and how do we estimate error rates?
What is boostrapping? We have looked at boostrapping for finding the standard error of an estimator and for bagging and random forest. What is the main idea behind bagging? What is the connection between bagging and random forests?
For regression trees - how is a simple way to perform boosting?

9. Support vector machines

and solutions to RecEx.

SVM is a method for both classification and regression, but we have only studied two-class classification (classes are coded \(-1\) and \(1\)).
Aim: find high dimensional hyperplan that separates two classes \(f({\bf x})=\beta_0+{\bf x}^T \mathbf\beta=0\). If \(y_if({\bf x}_i)>0\) observation \({\bf x}_i\) is correctly classified.
Central: maximizing the distance (on both sides) from the class boundary to the closes observations= the margin \(M\) (maximal marginal classifier) - which is relaxed with slack variables (support vector classifiers), and to allow nonlinear functions of \({\bf x}\) by extending an inner product to kernels (support vector machine).
Support vectors: observations that lie on the margin or on the wrong side of the margin.

Kernels: generalization of an inner product to allow for non-linear boundaries and to speed up calculations due to inner products only involve support vectors. Most popular kernel is radial \(K(x_i,x_i')=\exp(-\gamma\sum_{j=1}^p (x_{ij}-x_{i'j})^2)\).
Tuning parameters: cost and parameters in kernels - chosen by CV.
Sad: not able to present details since then a course in optimization is needed.
Nice connection to non-linar and ridged version of logistic regression - comparing hinge loss to logistic loss - but then without the computational advanges of the kernel method.

Questions/Problems:

Compulsory 2 in 2019: Problem 3
Compulsory 3 in 2018: Problem 2b
What is a support vector?
What are differences between a maximal margin classifier and linear discriminant analysis classifier?
What are the main differences between the maximal margin classifier and the support vector classifier? Explain the concept of a slack variable.
What are important aspects of the support vector machine?

10. Unsupervised learning: 6 files

Lecture 1 with Lab1 and New York times stories.
Lecture 2 with Lab2 and Lab3
and solutions to RecEx

Topics in Module 10

Principal component analysis:
- mathematical details (eigenvectors corresponding to covariance or correlation matrix) also in TMA4267.
- understanding loadings and scores and a biplot, choosing the number of principal components from proportion of variance explained or scree-type plots (elbow)
Clustering:
- \(k\)-means: number of clusters given, iterative algorithm to classify to nearest centroid and recalculate centroid
- hierarchical clustering: choice of distance measure, choice of linkage method (single, average, complete),

PCA for quality control

Hierarchical clustering for visualization

Questions/Problems:

Exam in 2018: Problem 5
Compulsory 2 in 2019: Problem 2
Compulsory 3 in 2018: Problem 3
Principal component analysis is both used as an unsupervised method and in a supervised regression setting. Explain briefly how we define the principal components (loadings and scores) and how the principal components are used in the two settings.
Could also have small numerical task to show that you have understood how to construct a dendrogram, as in Problem 5 of the 2018 exam.

11. Neural networks

and solutions to RecEx

Topics in Module 11:

Feedforward network architecture: mathematical formula - layers of multivariate transformed (relu, linear, sigmoid) inner products - sequentially connected.
What is the number of parameters that need to be estimated? Intercept term (for each layer) is possible and is referred to as “bias term”.
Loss function to minimize (on output layer): regression (mean squared), classification binary (binary crossentropy), classification multiple classes (categorical crossentropy) — and remember to connect to the correct choice of output activiation function: mean squared loss goes with linear activation, binary crossentropy with sigmoid, categorical crossentropy with softmax.
How to minimize the loss function: gradient based (chain rule) back-propagation - many variants.

Technicalities: nnet in R
Optional (not on reading list): keras in R. Use of tensors. Piping sequential layers, piping to estimation and then to evaluation (metrics).

Questions/Problems:

Exam in 2018: Problem 4, Q17
Compulsory 2 in 2019: Problem 3
Compulsory 3 in 2018: Problem 4
See RecEx Module 11 for short theoretical questions!

12. Summing-up (this module)

Questions/Problems - overall level

Make a graph with “horisontal axis: Model complexity” and “vertical axis: Interpretability” and position the classification methods we have covered in this course in the graph.
For many of the methods we have studied, the models are fitted minimizing a sum of a loss function and a tuning parameter times a penalty. Choose one method from regression and one from classification and explain what is the loss function and what is the penalty. Explain what the goal of the penalty is, and how the tuning parameter can be chosen.

This is Figure 2.7 from James et al. (2013). Explain what is the message of this figure.

Exam and exam preparation

Supervision before the exam - dates will be decided - tentatively 2hrs in the two weeks starting with May 6 and May 13, and then 2 hrs on May 20 or 21 and on May 22.
and maybe use the Discussion forum on Bb?

Digital exam

Our exam is an digital exam in Inspera Assessment.

https://innsida.ntnu.no/wiki/-/wiki/English/Digital+exam+for+students

https://innsida.ntnu.no/wiki/-/wiki/English/Digital+school+exam+-+for+students

Important:

You need to have the latest version of Safe Exam Browser (SEB) installed.
Go through the test exam to check that your installation is working.
Linux, android, iOS or ChromeOS are not supported: fill in form to ask for load pc if needed.

Nice: your exam paper is available to you after the exam.

The SEB will make sure that you may only access Inspera - and not any other programmes on your machine or on Internet. This means that R and R Studio will not be available to you at the exam.

In class: we looked at a digital exam held in TMA4315 Generelized linear models, to see how you may choose to write with the computer — or on a sheet of paper (“skoleskisser”).

The planned exam set-up

We have 30% on the compulsory exercies, and 70% on the written exam. These 70% are 70 points on the written exam.

Problem types

Since we have a digital exam, there is a possibility to not only use the “ordinary type of written exam”. This means:

For a question of type: “what are the interpretation of an receiver-operator curve?” you may write parts of the answer in an “essay box” and parts on paper (if you want to draw or write equations).
Build a sentence by choosing from drop-down menues.
Match equations and figures.
Four statements, choose the correct statement.
Calculation task, write the numerical answer for automatic correction.

Topic breakdown

Regression - 20-50 points
Classificaiton - 20-50 points
Unsupervised learning - 0-20 points
inherently: overfitting and bias-variance trade-off, train/validate/test and cross-validation, assumptions and reasoning behind models and methods, interpretation of results.

Regression problem

Explain about a data set and show print-out and residual plots from fitting a multiple linear regression model:
- interpret, write down formulas, assess model fit.

For example with the Framingham data set from the 2018 Compulsory exercise 1: Problem 2.

## 
## Call:
## lm(formula = -1/sqrt(SYSBP) ~ ., data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0207366 -0.0039157 -0.0000304  0.0038293  0.0189747 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.103e-01  1.383e-03 -79.745  < 2e-16 ***
## SEX         -2.989e-04  2.390e-04  -1.251 0.211176    
## AGE          2.378e-04  1.434e-05  16.586  < 2e-16 ***
## CURSMOKE    -2.504e-04  2.527e-04  -0.991 0.321723    
## BMI          3.087e-04  2.955e-05  10.447  < 2e-16 ***
## TOTCHOL      9.288e-06  2.602e-06   3.569 0.000365 ***
## BPMEDS       5.469e-03  3.265e-04  16.748  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005819 on 2593 degrees of freedom
## Multiple R-squared:  0.2494, Adjusted R-squared:  0.2476 
## F-statistic: 143.6 on 6 and 2593 DF,  p-value: < 2.2e-16

Expand into investigating non-linearities of one or more covariates, present print-out and plots
- explain, interpret, write down formulas, maybe compare linear vs. non-linear fit

library(gam)
m2 = gam(-1/sqrt(SYSBP) ~ SEX + s(AGE) + CURSMOKE + s(BMI) + s(TOTCHOL) + BPMEDS, data = data)
par(mfrow = c(2, 3))
plot(m2)

Fitting a penalized solution might improve on prediction accuracy. Explain what is done below and what is the suggested regression model. Write down the function that is minimized to fit this model, and explain how the tuning parameter is chosen.

library(glmnet)
x <- model.matrix(modelA)[, -1]
y <- -1/data$SYSBP
fit.lasso = glmnet(x, y)
plot(fit.lasso, xvar = "lambda", label = TRUE)

cv.lasso = cv.glmnet(x, y, nfolds = 10)
coef(cv.lasso, s = "lambda.1se")

## 7 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.034555e-02
## SEX          .           
## AGE          3.258758e-05
## CURSMOKE     .           
## BMI          3.553268e-05
## TOTCHOL      .           
## BPMEDS       7.413055e-04

Then move on to a tree, and look at full (and possibly pruned) tree. Prediction and interpretation. Theoretical questions on the fitting. Predict value for a new observation (numerically): given that a patient has BPMED=1, is 30 years of age and has a BMI of 27, what is the predicted value for the -1/sqrt(SYSBP)?

library(tree)
m3 = tree(-1/sqrt(SYSBP) ~ ., data = data)
plot(m3)
text(m3)

Then a test set should mysteriously appear to be part of a testing regime and you evaluate the results and compare them.

Classification problem

Explain about a problem with 2 or more classes, for example the South African heart disease data set (from module 4). Traning and test set.

Write down the fitted model. The estimated coefficient for famhist is \(1.047\). How can do explain the effect famhist? How would you evaluate the fit of this model?

## 
## Call:
## glm(formula = chd ~ tobacco + famhist + typea + obesity + age, 
##     family = "binomial", data = train_SA)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1165  -0.8491  -0.4142   0.9481   2.2283  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -6.452432   1.535920  -4.201 2.66e-05 ***
## tobacco         0.078834   0.035911   2.195  0.02815 *  
## famhistPresent  1.047384   0.323689   3.236  0.00121 ** 
## typea           0.044812   0.017458   2.567  0.01026 *  
## obesity        -0.003855   0.036471  -0.106  0.91581    
## age             0.060659   0.014736   4.116 3.85e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 301.69  on 230  degrees of freedom
## Residual deviance: 239.87  on 225  degrees of freedom
## AIC: 251.87
## 
## Number of Fisher Scoring iterations: 5

To evaluate the model fit on a new test set an ROC curve is made. How is this curve constructed? Explain what you see and evaluate the goodness of the model.

Another classification method is linear discriminant analysis (LDA). Would LDA be a suitable method for this data set? What are the assumption the LDA classifier is based on?
Tree and pruned tree also possible here - and bagging and random forests - interpretation and theoretical questions.
If you were to fit a feedforward neural network to this data set (with the covariates listed above), suggest a possible network architecture. What would be a sensible loss function?

Unsupervised learning problem

Similar to the two tasks in the 2018 Compulsory 3: Problem 3a and 3b – which is to comment on and recognize method used.
Could also have small calculation task to show that you have understood concepts, like Problem 5 on the 2018 exam.
Consider the following four observation of a two-dimensional random vector \({\bf X}=\begin{pmatrix} X_1 \\ X_2\end{pmatrix}\) \[ {\bf a}=\begin{pmatrix}5\\4 \end{pmatrix}, {\bf b}=\begin{pmatrix}1\\-1 \end{pmatrix}, {\bf c}=\begin{pmatrix}-1\\1 \end{pmatrix}, {\bf d}=\begin{pmatrix}4\\0 \end{pmatrix}\] Calculate the matrix of pairwise Euclidean distances between the points. Use hierarchical clustering with single, complete and average linkage to cluster the points. Draw dendrograms. Assume that we want two clusters. Which two groups will then the dendrograms give? [M10, exam TMA4270 1995 3a] [Solutions to that exam (scroll to 3a)](https://www.math.ntnu.no/TMA4268/2019v/Exam/75554Des1995LF.pdf)

After TMA4268 - what is next?

What are the statistical challenges we have not covered?

Do you want to learn more about the methods we have looked at in this course? And also methods that are more tailored towards specific types of data? Then we have many statistics courses that you may choose from.

An overview of statistics courses and also information on the statistics staff (for bachelor and master supervision) https://folk.ntnu.no/mettela/Talks/3klinfo20190325.html

On behalf of the teaching staff - Michail, Andreas, Thiago and Mette-

thank you for attending this course - hope to see you for the exam supervision - and good luck on May 23!

References

Efron, Bradley, and Trevor Hastie. 2016. Computer Age Statistical Inference - Algorithms, Evidence, and Data Science. Cambridge University Press.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics New York.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. Vol. 112. Springer.

TMA4268 Statistical Learning V2019

Module 12: SUMMING UP

Mette Langaas, Department of Mathematical Sciences, NTNU

week 15

Overview

Added after class

Course content

Learning outcome

Final reading list

Core of the course

The modules

1. Introduction

Topics in Module 1

2. Statistical learning

Topics in Module 2

Questions/Problems:

3. Linear regression

Topics in Module 3

Questions/Problems:

4. Classification

Topics in Module 4

Questions/Problems:

The bias-variance trade-off in the classification setting?

5. Resampling methods

Topics in Module 5

Questions/Problems:

6. Linear model selection and regularization:

Topics in Module 6:

Questions/Problems:

7. Moving beyond linearity

Topics in Module 7

Questions/Problems:

8. Tree-based methods

Topics in Module 8

Questions/Problems:

9. Support vector machines

Questions/Problems:

10. Unsupervised learning: 6 files

Topics in Module 10

Questions/Problems:

11. Neural networks

Topics in Module 11:

Questions/Problems:

12. Summing-up (this module)

Questions/Problems - overall level

Exam and exam preparation

Digital exam

The planned exam set-up

Problem types

Topic breakdown

Regression problem

Classification problem

Unsupervised learning problem

After TMA4268 - what is next?

References