TMA4268 Statistical Learning V2018

MODULE 1: INTRODUCTION TO STATISTICAL LEARNING

Mette Langaas, Department of Mathematical Sciences, NTNU – with contributions from Julia Debik

Week 2 2018 (Version 11.2017)

Aim of the module

  • overview of the course
  • learning outcome
  • new words: statistical learning, data mining, data science
  • the role of R
  • short presentation of course modules
  • practical details of the course (Blackboard)
  • the basics of R and Rstudio
  • key concepts from your first course in statistics – that you will need now, mixed with notation for this course - added 2-3 important data sets?
  • introductory course on R (performing the exam? for TMA4240H2017 or TMA4245V2017? in R?)

Learning material for this module: Our textbook James et al (2013): An Introduction to Statistical Learning. Chapter 1. Links to R resources in the end of this page.

Who is this course for?

  • Bachelor level: 3rd year student from Science or Technology programs.
  • Statistics background: TMA4240/45 Statistics or equivalent.
  • No background in statistical software needed: but we will use the R statistical software extensively in the course.
  • Not a prerequisist but a good thing with knowledge of computing - preferably an introductory course in informatics, like TDT4105 or TDT4110.
  • There is some overlap with TDT Data mining and case based reasoning - but the courses differ in philosophy (computer science vs. statistics).

Important: the course has focus on statistical theory, but all models and methods on the reading list will also be investigated using (mostly) available function in R and real data sets. It it important that the student in the end of the course can analyses all types of data (covered in the course) - not just understand the theory. And vice versa - this is not a “we lear how to perform data analysis”-course - the student must also understand the model, methods and algorithms used. There is a final written exam (70% on final grad) in addition to compulsory exercises (30% on final grade).

Course content

Statistical learning, multiple linear regression, classification, resampling methods, model selection/regularization, non-linearity, support vector machines, tree-based methods, unsupervised methods, neural nets.

Learning outcome

  1. Knowledge. The student has knowledge about the most popular statistical learning models and methods that are used for prediction in science and technology. Emphasis is on regression- og classification-type statistical models.

  2. Skills. The student knows, based on an existing data set, how to choose a suitable statistical model, apply sound statistical methods, and perform the analyses using statistical software. The student knows how to present the results from the statistical analyses, and which conclusions can be drawn from the analyses.

Learning methods and activities

  • Lectures, exercises and works (projects).
  • Portfolio assessment is the basis for the grade awarded in the course. This portfolio comprises a written final examination (70%) and works (projects) (30%). The results for the constituent parts are to be given in %-points, while the grade for the whole portfolio (course grade) is given by the letter grading system. Retake of examination may be given as an oral examination. The lectures may be given in English.

Textbook: James, Witten, Hastie, Tibshirani (2013): “An Introduction to Statistical Learning”.

Tentative reading list: the whole book + the module pages+ three compulsory exercises.

Course team

Lecturers:

  • Mette Langaas (IMF/NTNU)
  • Thiago G. Martins (AIAscience and IMF/NTNU)

Teaching assistents (TA)

  • Thea Roksvåg (head TA)
  • Andreas Strand
  • Martina Hall

If needed, also student assistents.

New words

Statistical learning

Data mining

Data science

The modules - with tentative plan

  1. Introduction (the module page you are reading now) [Ch 1] 2018-w2

  2. Statistical Learning [Ch 2] 2018-w3

  3. Linear Regression [Ch 3] 2018-w4

  4. Classification [Ch 4] 2018-w5

Compulsory exercise 1

  1. Resampling methods [Ch 5] 2018-w7

  2. Linear Model Selection and Regularization [Ch 6] 2018-w8

  3. Moving Beyond Linearity [Ch 7] 2018-w9

Compulsory exercise 2

  1. Tree-Based Methods [Ch 8] 2018-w11

  2. Support Vector Machines [Ch 9] 2018-w12

Easter break 2018-w13

  1. Unsupervised learning [Ch 10] 2018-w15

  2. Neural Networks 2018-w16

  3. Summing up and exam preparation 2018-w17

Compulsory 3


Module 2: Statistical learning

Module 3: Multiple linear regression

Example: Exam TMA4267, V2017, Problem 2: CVD

The Framingham Heart Study is a study of the etiology (i.e. underlying causes) of cardiovascular disease (CVD), with participants from the community of Framingham in Massachusetts, USA https://www.framinghamheartstudy.org/. This dataset is subset of a teaching version of the Framingham data, used with permission from the Framingham Heart Study.


We will focus on modelling systolic blood pressure using data from \(n=2600\) persons. For each person in the data set we have measurements of the following seven variables

  • SYSBP systolic blood pressure (mmHg),

  • SEX 1=male, 2=female,

  • AGE age (years) at examination,

  • CURSMOKE current cigarette smoking at examination: 0=not current smoker, 1= current smoker,

  • BMI body mass index (kg/m\(^2\)),

  • TOTCHOL serum total cholesterol (mg/dl), and

  • BPMEDS use of anti-hypertensive medication at examination: 0=not currently using, 1=currently using.


A multiple normal linear regression model was fitted to the data set with \(-\frac{1}{\sqrt{SYSBP}}\) as response and all the other variables as covariates.

The data set is here called thisds.

modelB=lm(-1/sqrt(SYSBP)~SEX+AGE+CURSMOKE+BMI+TOTCHOL+BPMEDS,data=thisds)
summary(modelB)
## 
## Call:
## lm(formula = -1/sqrt(SYSBP) ~ SEX + AGE + CURSMOKE + BMI + TOTCHOL + 
##     BPMEDS, data = thisds)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0207366 -0.0039157 -0.0000304  0.0038293  0.0189747 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.103e-01  1.383e-03 -79.745  < 2e-16 ***
## SEX         -2.989e-04  2.390e-04  -1.251 0.211176    
## AGE          2.378e-04  1.434e-05  16.586  < 2e-16 ***
## CURSMOKE    -2.504e-04  2.527e-04  -0.991 0.321723    
## BMI          3.087e-04  2.955e-05  10.447  < 2e-16 ***
## TOTCHOL      9.288e-06  2.602e-06   3.569 0.000365 ***
## BPMEDS       5.469e-03  3.265e-04  16.748  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005819 on 2593 degrees of freedom
## Multiple R-squared:  0.2494, Adjusted R-squared:  0.2476 
## F-statistic: 143.6 on 6 and 2593 DF,  p-value: < 2.2e-16

Example: The SLID data set

The SLID data set is available in the car package. It cotains data from the 1994 wave of the Canadian Survey of Labour and Income Dynamics, from the province of Ohio. The dimension of the data set is 7425 rows and 5 columns, however the data set contains many missing values.


We will here model the composite hourly wage rate from all jobs, wages, as a function of four variables:

  • education: Number of years of schooling.
  • age : Age in years.
  • sex : A categorical variable with two classes: {Male, Female}.
  • language: A categorical variable with three classes: {English, French, Other}.

A scatterplot can let us take a first glance at the data set.

library(car)
library(ggplot2)
library(ggpubr)
p1 = ggplot(SLID, aes(education, wages)) + geom_point(size=0.5) 
p2 = ggplot(SLID, aes(age, wages)) + geom_point(size=0.5)
p3 = ggplot(SLID, aes(sex, wages)) + geom_point(size=0.5)
p4 = ggplot(SLID, aes(language, wages)) + geom_point(size=0.5)
ggarrange(p1, p2, p3, p4)

We see that there is a relationship between the response variable wages and the variables education, sex, age and languages.

A multiple normal linear regression model was fitted to the data set with wages as response and all the other variables as covariates.

lm_wages = lm(wages ~ ., data = SLID)
summary(lm_wages)
## 
## Call:
## lm(formula = wages ~ ., data = SLID)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.062  -4.347  -0.797   3.237  35.908 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -7.888779   0.612263 -12.885   <2e-16 ***
## education       0.916614   0.034762  26.368   <2e-16 ***
## age             0.255137   0.008714  29.278   <2e-16 ***
## sexMale         3.455411   0.209195  16.518   <2e-16 ***
## languageFrench -0.015223   0.426732  -0.036    0.972    
## languageOther   0.142605   0.325058   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.6 on 3981 degrees of freedom
##   (3438 observations deleted due to missingness)
## Multiple R-squared:  0.2973, Adjusted R-squared:  0.2964 
## F-statistic: 336.8 on 5 and 3981 DF,  p-value: < 2.2e-16

Module 4: Classification

Classification is the the approach we take for predicting qualitative responses (recall that a qualitative response is a response that can take on one of many distinct classes). In other words: given a new observation, we want to assign the observation to a class. For that we need a training set containing observations whose class membership is known and a classification rule. We will in this module discuss the classifiers logistic regression, linear and quadratic discriminant analysis and K-nearest neighbors.

Example: Spam filtering

Spam filtering is a binary classification task, where the task is to distinguish between two types of e-mails: “spam” and “non-spam”. This classification can be made according to the probabilities of a word occuring in spam email and in legitimate email. The classification can thus be performed by counting the percentage of words and characters in the e-mail.


  • word_freq_WORD : Percentage of words in the e-mail that match WORD
  • char_freq_CHAR : Percentage of characters in the e-mail that match CHAR
  • capital_run_length_average : Average length of uninterrupted sequence of capital letters
  • capital_run_length_longest : Length of longest uninterrupted sequence of capital letters
  • capital_run_length_total : Total number of capital letters n the e-mail

A logistic regression model has been fitted to the spam data set, with all variables as covariates, where the response is a factor variable spam: y=1, non-spam:y=0.

glm_spam = glm(y~., data=spam_dataset, family="binomial")
summary(glm_spam)
## 
## Call:
## glm(formula = y ~ ., family = "binomial", data = spam_dataset)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -4.127  -0.203   0.000   0.114   5.364  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -1.569e+00  1.420e-01 -11.044  < 2e-16 ***
## word_freq_make             -3.895e-01  2.315e-01  -1.683 0.092388 .  
## word_freq_address          -1.458e-01  6.928e-02  -2.104 0.035362 *  
## word_freq_all               1.141e-01  1.103e-01   1.035 0.300759    
## word_freq_3d                2.252e+00  1.507e+00   1.494 0.135168    
## word_freq_our               5.624e-01  1.018e-01   5.524 3.31e-08 ***
## word_freq_over              8.830e-01  2.498e-01   3.534 0.000409 ***
## word_freq_remove            2.279e+00  3.328e-01   6.846 7.57e-12 ***
## word_freq_internet          5.696e-01  1.682e-01   3.387 0.000707 ***
## word_freq_order             7.343e-01  2.849e-01   2.577 0.009958 ** 
## word_freq_mail              1.275e-01  7.262e-02   1.755 0.079230 .  
## word_freq_receive          -2.557e-01  2.979e-01  -0.858 0.390655    
## word_freq_will             -1.383e-01  7.405e-02  -1.868 0.061773 .  
## word_freq_people           -7.961e-02  2.303e-01  -0.346 0.729557    
## word_freq_report            1.447e-01  1.364e-01   1.061 0.288855    
## word_freq_addresses         1.236e+00  7.254e-01   1.704 0.088370 .  
## word_freq_free              1.039e+00  1.457e-01   7.128 1.01e-12 ***
## word_freq_business          9.599e-01  2.251e-01   4.264 2.01e-05 ***
## word_freq_email             1.203e-01  1.172e-01   1.027 0.304533    
## word_freq_you               8.131e-02  3.505e-02   2.320 0.020334 *  
## word_freq_credit            1.047e+00  5.383e-01   1.946 0.051675 .  
## word_freq_your              2.419e-01  5.243e-02   4.615 3.94e-06 ***
## word_freq_font              2.013e-01  1.627e-01   1.238 0.215838    
## word_freq_000               2.245e+00  4.714e-01   4.762 1.91e-06 ***
## word_freq_money             4.264e-01  1.621e-01   2.630 0.008535 ** 
## word_freq_hp               -1.920e+00  3.128e-01  -6.139 8.31e-10 ***
## word_freq_hpl              -1.040e+00  4.396e-01  -2.366 0.017966 *  
## word_freq_george           -1.177e+01  2.113e+00  -5.569 2.57e-08 ***
## word_freq_650               4.454e-01  1.991e-01   2.237 0.025255 *  
## word_freq_lab              -2.486e+00  1.502e+00  -1.656 0.097744 .  
## word_freq_labs             -3.299e-01  3.137e-01  -1.052 0.292972    
## word_freq_telnet           -1.702e-01  4.815e-01  -0.353 0.723742    
## word_freq_857               2.549e+00  3.283e+00   0.776 0.437566    
## word_freq_data             -7.383e-01  3.117e-01  -2.369 0.017842 *  
## word_freq_415               6.679e-01  1.601e+00   0.417 0.676490    
## word_freq_85               -2.055e+00  7.883e-01  -2.607 0.009124 ** 
## word_freq_technology        9.237e-01  3.091e-01   2.989 0.002803 ** 
## word_freq_1999              4.651e-02  1.754e-01   0.265 0.790819    
## word_freq_parts            -5.968e-01  4.232e-01  -1.410 0.158473    
## word_freq_pm               -8.650e-01  3.828e-01  -2.260 0.023844 *  
## word_freq_direct           -3.046e-01  3.636e-01  -0.838 0.402215    
## word_freq_cs               -4.505e+01  2.660e+01  -1.694 0.090333 .  
## word_freq_meeting          -2.689e+00  8.384e-01  -3.207 0.001342 ** 
## word_freq_original         -1.247e+00  8.064e-01  -1.547 0.121978    
## word_freq_project          -1.573e+00  5.292e-01  -2.973 0.002953 ** 
## word_freq_re               -7.923e-01  1.556e-01  -5.091 3.56e-07 ***
## word_freq_edu              -1.459e+00  2.686e-01  -5.434 5.52e-08 ***
## word_freq_table            -2.326e+00  1.659e+00  -1.402 0.160958    
## word_freq_conference       -4.016e+00  1.611e+00  -2.493 0.012672 *  
## `char_freq_;`              -1.291e+00  4.422e-01  -2.920 0.003503 ** 
## `char_freq_(`              -1.881e-01  2.494e-01  -0.754 0.450663    
## `char_freq_[`              -6.574e-01  8.383e-01  -0.784 0.432914    
## `char_freq_!`               3.472e-01  8.926e-02   3.890 0.000100 ***
## `char_freq_$`               5.336e+00  7.064e-01   7.553 4.24e-14 ***
## `char_freq_#`               2.403e+00  1.113e+00   2.159 0.030883 *  
## capital_run_length_average  1.199e-02  1.884e-02   0.636 0.524509    
## capital_run_length_longest  9.118e-03  2.521e-03   3.618 0.000297 ***
## capital_run_length_total    8.437e-04  2.251e-04   3.747 0.000179 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6170.2  on 4600  degrees of freedom
## Residual deviance: 1815.8  on 4543  degrees of freedom
## AIC: 1931.8
## 
## Number of Fisher Scoring iterations: 13

We can for example see that a high occurance of the words free, business or your might indicate that the email is a spam. The same is true for a high occurance of the characters ! and $ and a long length of uninterrupted sequence of capital letters. On the other hand, a high frequency of the words report, labs or telnet does not indicate a spam mail.

Example: Classification of Iris plants

Image taken from: http://blog.kaggle.com/2015/04/22/scikit-learn-video-3-machine-learning-first-steps-with-the-iris-dataset/

. The iris flower data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in 1936. The data set contains three plant species {setosa, virginica, versicolor} and four features measured for each corresponding sample: Sepal.Length, Sepal.Width, Petal.Length and Petal.Width.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

In this case the aim would be to correctly classify the species of an iris plant, based on the measurement of its sepal length and sepal width. This model will give linear boundaries between groups.


A competing method is linear discriminant analysis (LDA). This method (also) provides a way in which linear boundaries between groups can be created.

In this plot the small black dots represent correctly classified iris plants, while the red dots represent misclassifications. The big black dots represent the class means.


Sometimes a more suitable boundary is not linear. Quadratic discriminant analysis (QDA) provides a way in which quadratic boundaries between groups can be created.

Module 5: Resampling Methods

Two of the most commonly used resampling methods are cross-validation and the bootstrap. Cross-validation is often used to choose appropriate values for tuning parameters. Bootstrap is often used to provide a measure of accuracy of a parameter estimate.

Cross-validation and the choice of neighbours in KNN

A \(K\)-nearest neighbors classifier (KNN) classifies a test observation \(x_0\) by identifying the \(K\) points in the training data that are closest to \(x_0\). The predicted class for the test observation is then found by a majority vote of its neighbors. That is, \(x_0\) is classified to the class which is most common among its neighbors. When using the KNN algorithm, the choice of \(K\) is crucial for obtaining an accurate prediction. Cross validation is a great tool to decide for the best value of \(K\).


In \(K\)-fold cross-validation the original data set is randomly partitioned into \(K\) parts of equal size. \(K-1\) of the parts make up the training set, and are used to train a statistical method. The predictive power is then tested on the left-out part, which is the test set. This procedure is repeated \(K\) times, with a new part of the original data set left out as a test set, each time.


When wanting to use the KNN classifier on the iris data set, we need to decide on the number of neighbors we want to make the comparison to. By trying all values of \(K\) from 0 to 50, and comparing the misclassification error rate estimated by Leave-one-out cross-validation, the optimal value for \(K\) can be chosen. We look at the procedure in more detail for the choice \(K=5\). We fit a KNN classifier to the train which consists of all but the last column of the iris data set. The true class labels are in the last colum of the data set and are fed to the knn function to the cl variable. We print the confusion matrix to see the number of misclassifications by calling the function table and passing on the true and predicted classes.

library(class)
set.seed(100)

knn5 = knn.cv(train = iris[,-5], cl=iris[,5], k=5)
t = table(knn5, iris[,5])
t
##             
## knn5         setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         47         2
##   virginica       0          3        48
knn5_misclas = (150-sum(diag(t)))/150
knn5_misclas
## [1] 0.03333333

We see that the correct classifications are found on the diagonal. 50+47+48 = 148 observations out of 150 were correctly classified using our 5-nearest neighbor classifier. The misclassification error rate is 3.33%. We proceed by trying all values of \(K\) from 1 to 50:

misclas=numeric(50)
for(k in 1:50){
  knn = knn.cv(train = iris[,-5], cl=iris[,5], k=k)
  t = table(knn, iris[,5])
  error = (150-sum(diag(t)))/150
  misclas[k] = error
  error
}

knn_ds = data.frame( x = 1:50, y = misclas)
ggplot(knn_ds, aes(x = x, y=y)) + geom_point() + xlab("Number of neighbors k") + ylab("Misclassification error") + ggtitle("Error rate for classification of iris plants with varying k") + geom_line()

We see that a choice of \(K = 14\) might be a good choice.

Example: Estimating the accuracy of a linear regression model

Recall our multiple linear regression model fit to the SLID data set. By fitting a linear regression model using the lm function we obtained estimates for the coefficients and their corresponding standard errors. Another way to obtain estimates of the standard errors for the coefficients, is by using bootstrap. A bootstrap sample is a sample of a data set obtained by drawing with replacement. This means that each bootstrap sample is slightly different from the original data set. By fitting a seperate linear regression model to each of the bootstrap samples, we can estimate the standard errors of the estimated coefficients. We will demonstrate it here on the SLID data set.

library(boot)
library(car)
set.seed(10)
boot.fn = function(data, index){
  return(coef(lm(wages~., data=data, subset=index)))
}
boot(SLID, boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = SLID, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##        original        bias    std. error
## t1* -7.88877921 -0.0232394336 0.675092870
## t2*  0.91661398  0.0008373061 0.039330311
## t3*  0.25513680  0.0002563369 0.008914467
## t4*  3.45541061  0.0063374727 0.210507559
## t5* -0.01522333  0.0236211853 0.441901231
## t6*  0.14260463 -0.0006454362 0.329202108

The standard errors of 1000 bootstrap estimates for the coefficients in the linear model fit to the SLID data set are given in the std.error column. For example: the bootstrap estimate for SE(\(\hat{\beta_0}\)) is 0.67509.

Module 6: Linear Model Selection and Regularization

Recall the standard linear model \[Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \varepsilon\], where the relationship between the response variable \(Y\) and a set of variables \(X_1, X_2, ..., X_p\). We will in this module discuss how the linear model can be improved, by replacing least squares fitting with alternative fitting procedures. The motivation for this is that alternative fitting procedures can yield better prediction accuracy and model interpretability.


Example: Subset selection

library(ISLR)
data(Hitters)
Hitters = na.omit(Hitters)
lm1 = lm(Salary~., data=Hitters)
summary(lm1)
## 
## Call:
## lm(formula = Salary ~ ., data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -907.62 -178.35  -31.11  139.09 1877.04 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  163.10359   90.77854   1.797 0.073622 .  
## AtBat         -1.97987    0.63398  -3.123 0.002008 ** 
## Hits           7.50077    2.37753   3.155 0.001808 ** 
## HmRun          4.33088    6.20145   0.698 0.485616    
## Runs          -2.37621    2.98076  -0.797 0.426122    
## RBI           -1.04496    2.60088  -0.402 0.688204    
## Walks          6.23129    1.82850   3.408 0.000766 ***
## Years         -3.48905   12.41219  -0.281 0.778874    
## CAtBat        -0.17134    0.13524  -1.267 0.206380    
## CHits          0.13399    0.67455   0.199 0.842713    
## CHmRun        -0.17286    1.61724  -0.107 0.914967    
## CRuns          1.45430    0.75046   1.938 0.053795 .  
## CRBI           0.80771    0.69262   1.166 0.244691    
## CWalks        -0.81157    0.32808  -2.474 0.014057 *  
## LeagueN       62.59942   79.26140   0.790 0.430424    
## DivisionW   -116.84925   40.36695  -2.895 0.004141 ** 
## PutOuts        0.28189    0.07744   3.640 0.000333 ***
## Assists        0.37107    0.22120   1.678 0.094723 .  
## Errors        -3.36076    4.39163  -0.765 0.444857    
## NewLeagueN   -24.76233   79.00263  -0.313 0.754218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 315.6 on 243 degrees of freedom
## Multiple R-squared:  0.5461, Adjusted R-squared:  0.5106 
## F-statistic: 15.39 on 19 and 243 DF,  p-value: < 2.2e-16
lm2 = update(lm1, .~.-CHmRun)
summary(lm2)
## 
## Call:
## lm(formula = Salary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + 
##     Years + CAtBat + CHits + CRuns + CRBI + CWalks + League + 
##     Division + PutOuts + Assists + Errors + NewLeague, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -914.51 -175.66  -31.72  137.49 1876.79 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  163.08380   90.59426   1.800 0.073071 .  
## AtBat         -1.97939    0.63268  -3.129 0.001970 ** 
## Hits           7.44499    2.31485   3.216 0.001475 ** 
## HmRun          4.03304    5.52892   0.729 0.466429    
## Runs          -2.27127    2.80872  -0.809 0.419504    
## RBI           -0.96237    2.47840  -0.388 0.698131    
## Walks          6.20550    1.80884   3.431 0.000707 ***
## Years         -3.42721   12.37355  -0.277 0.782031    
## CAtBat        -0.17461    0.13146  -1.328 0.185336    
## CHits          0.18359    0.48861   0.376 0.707440    
## CRuns          1.40160    0.56455   2.483 0.013714 *  
## CRBI           0.73870    0.25026   2.952 0.003468 ** 
## CWalks        -0.80172    0.31424  -2.551 0.011343 *  
## LeagueN       63.12305   78.94944   0.800 0.424756    
## DivisionW   -116.85917   40.28499  -2.901 0.004062 ** 
## PutOuts        0.28224    0.07721   3.655 0.000315 ***
## Assists        0.37319    0.21986   1.697 0.090902 .  
## Errors        -3.38913    4.37472  -0.775 0.439262    
## NewLeagueN   -25.31356   78.67426  -0.322 0.747916    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 314.9 on 244 degrees of freedom
## Multiple R-squared:  0.5461, Adjusted R-squared:  0.5126 
## F-statistic: 16.31 on 18 and 244 DF,  p-value: < 2.2e-16
lm3 = update(lm2, .~.-Years)
summary(lm3)
## 
## Call:
## lm(formula = Salary ~ AtBat + Hits + HmRun + Runs + RBI + Walks + 
##     CAtBat + CHits + CRuns + CRBI + CWalks + League + Division + 
##     PutOuts + Assists + Errors + NewLeague, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -912.02 -180.92  -34.89  138.05 1881.51 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  148.43333   73.41097   2.022 0.044268 *  
## AtBat         -1.95091    0.62309  -3.131 0.001953 ** 
## Hits           7.39141    2.30241   3.210 0.001503 ** 
## HmRun          4.08280    5.51558   0.740 0.459869    
## Runs          -2.23967    2.80111  -0.800 0.424737    
## RBI           -0.99402    2.47109  -0.402 0.687845    
## Walks          6.19706    1.80517   3.433 0.000701 ***
## CAtBat        -0.19133    0.11657  -1.641 0.102010    
## CHits          0.20673    0.48050   0.430 0.667398    
## CRuns          1.42497    0.55716   2.558 0.011144 *  
## CRBI           0.74147    0.24958   2.971 0.003265 ** 
## CWalks        -0.80376    0.31356  -2.563 0.010966 *  
## LeagueN       64.19282   78.70619   0.816 0.415521    
## DivisionW   -116.06176   40.10620  -2.894 0.004148 ** 
## PutOuts        0.28303    0.07702   3.675 0.000292 ***
## Assists        0.37732    0.21894   1.723 0.086083 .  
## Errors        -3.31999    4.35935  -0.762 0.447044    
## NewLeagueN   -24.88922   78.51099  -0.317 0.751502    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 314.3 on 245 degrees of freedom
## Multiple R-squared:  0.546,  Adjusted R-squared:  0.5144 
## F-statistic: 17.33 on 17 and 245 DF,  p-value: < 2.2e-16
lm4 = update(lm3, .~.-NewLeagueN)
library(glmnet)
library(reshape2)


x_reg = model.matrix(Salary~., data=Hitters)[,-1]
y_reg = Hitters$Salary
grid = 10^seq(10 , -2 , length =100)
ridge = glmnet(x_reg ,y_reg, alpha =0 , lambda = grid)
coef_matrix_ridge = t(coef(ridge))
betas_ridge = as.data.frame(matrix(coef_matrix_ridge, ncol=20))
names(betas_ridge) = rownames(coef(ridge))
llambdas_ridge = log(ridge$lambda)
betas_ridge = cbind(betas_ridge, llambdas_ridge)
melted_ridge = melt(betas_ridge[,2:21], id.vars="llambdas_ridge")
ggplot(melted_ridge, aes(y=value, x = llambdas_ridge, colour=variable)) + geom_line()

lasso = glmnet(x_reg ,y_reg, alpha =1 , lambda = grid)
coef_matrix_lasso = t(coef(lasso))
betas_lasso = as.data.frame(matrix(coef_matrix_lasso, ncol=20))
names(betas_lasso) = rownames(coef(lasso))
llambdas_lasso = log(lasso$lambda)
betas_lasso = cbind(betas_lasso, llambdas_lasso)
melted_lasso = melt(betas_lasso[,2:21], id.vars="llambdas_lasso")
ggplot(melted_lasso, aes(y=value, x = llambdas_lasso, colour=variable)) + geom_line()

library(lubridate)
marathon_ds = read.csv("marathon_results_2017.csv",header=TRUE)
names(marathon_ds)
##  [1] "X"             "Bib"           "Name"          "Age"          
##  [5] "M.F"           "City"          "State"         "Country"      
##  [9] "Citizen"       "X.1"           "X5K"           "X10K"         
## [13] "X15K"          "X20K"          "Half"          "X25K"         
## [17] "X30K"          "X35K"          "X40K"          "Pace"         
## [21] "Proj.Time"     "Official.Time" "Overall"       "Gender"       
## [25] "Division"
Official.Time = hms(marathon_ds$Official.Time)
Official.Time = hour(Official.Time)*60*60 + minute(Official.Time)*60 + second(Official.Time)

M.F = as.factor(marathon_ds$M.F)
Country = as.factor(marathon_ds$Country)

Age = marathon_ds$Age

X5K = hms(marathon_ds$X5K)
X5K = hour(X5K)*60*60 + minute(X5K)*60 + second(X5K)

X10K = hms(marathon_ds$X10K)
X10K = hour(X10K)*60*60 + minute(X10K)*60 + second(X10K)

Half = hms(marathon_ds$Half)
Half = hour(Half)*60*60 + minute(Half)*60 + second(Half)

X30K =  seconds(hms(marathon_ds$X30K))
X30K = hour(X30K)*60*60+minute(X30K)*60 + second(X30K)

ds_r = data.frame(Official.Time, M.F, Country, Age, X5K, X10K, Half, X30K)
ds_r = na.omit(ds_r)

lm1 = lm(Official.Time~., data=ds_r)
lm2  = update(lm1, .~.-Country)
lm3 = update(lm2, .~.-X10K)
x_m = model.matrix(Official.Time~.-Country, data=ds_r)[,-1]
y_m = ds_r$Official.Time
grid = 10^seq(10 , -2 , length =100)
ridge_m = glmnet(x_m ,y_m, alpha =0 , lambda = grid)
coef_matrix_ridge_m = t(coef(ridge_m))
betas_ridge_m = as.data.frame(matrix(coef_matrix_ridge_m, ncol=7))
names(betas_ridge_m) = rownames(coef(ridge_m))
llambdas_ridge_m = log(ridge_m$lambda)
betas_ridge_m = cbind(betas_ridge_m, llambdas_ridge_m)
melted_ridge_m = melt(betas_ridge_m[,2:8], id.vars="llambdas_ridge_m")
ggplot(melted_ridge_m, aes(y=value, x = llambdas_ridge_m, colour=variable)) + geom_line()

lasso_m = glmnet(x_m ,y_m, alpha =1 , lambda = grid)
coef_matrix_lasso_m = t(coef(lasso_m))
betas_lasso_m = as.data.frame(matrix(coef_matrix_lasso_m, ncol=7))
names(betas_lasso_m) = rownames(coef(lasso_m))
llambdas_lasso_m = log(lasso_m$lambda)
betas_lasso_m = cbind(betas_lasso_m, llambdas_lasso_m)
melted_lasso_m = melt(betas_lasso_m[,2:8], id.vars="llambdas_lasso_m")
ggplot(melted_lasso_m, aes(y=value, x = llambdas_lasso_m, colour=variable)) + geom_line()


Module 7: Moving Beyond Linearity


Module 8: Tree-Based Methods

Copy from Thea’s module


Module 9: Support Vector Machines


Module 10: Unsupervised Learning


Module 11: Neural Networks


Common - for all modules

WHAT to emphasize here?

Practical details - go to Blackboard

Teaching philosophy in this course

The modules:

  • divide into modular units with specific focus
  • smaller units (time and topic) facilitates learning?

The lectures:

  • We have 2*2hrs of lectures every week (except when working with the compulsory exercises).
  • Each week is a new topic and a new module (maybe a few exceptions)
  • Each new module always starts with a motivating example.
  • We focus on:
    • statistical theory
    • application of theory to real (and simulated) data using the R language and the Rstudio IDE
    • interpretation of results
    • assumptions and limitations of the models and methods
    • extra material wrt theory is either given on the module page or linked to from other sources
    • possible extensions and further reading (beyond the scope of this course) are given in the end of the module page
  • Each week the second lecture always ends with a Kahoot! quiz (w/wo bots?)

The weekly supervision sessions: For each module recommended exercises are given. These are partly

  • theoretical exercises (from book or not)
  • computational tasks
  • data analysis

These are supervised in the weekly exercise slots.

The compulsory exercises:

  • There are three compulsory exercises, each gives a maximal score of 10 points.
  • These are supervised in the weekly exercise slots (and also in the lecture slots the week of the hand-in deadline).
  • Focus is both on theory and analysis and interpretation in R.
  • Students can work in groups (max size 3) and work must be handed in on Blackboard and be written in R Markdown (both .Rmd and .pdf handed in).
  • The TAs grade the exercises (0-10 points).
  • This gives 30% of the final evaluation in the course, and the written exam the final 70%.

R, Rstudio, CRAN and GitHub - and R Markdown

What is R? https://www.r-project.org/about.html

What is Rstudio? https://www.rstudio.com/products/rstudio/

What is an R package? http://r-pkgs.had.co.nz (We will make an R package in the exercise part of this course.)

What is CRAN? https://cran.uib.no/

What is GitHub and Bitbucket? Do we need GitHub or Bitbucket in our course? https://www.youtube.com/watch?v=w3jLJU7DT5E and https://techcrunch.com/2012/07/14/what-exactly-is-github-anyway/

What is R Markdown? http://r4ds.had.co.nz/r-markdown.html

What is knitr? https://yihui.name/knitr/

A first look at R and R studio

LINK TO BEGINNER STUFF: Rbeginner

For the advanced R user

Beautiful graphics with ggplot

LINK TO GGPLOT STUFF: Rggplot

Explore R Markdown in Rstudio

LINK TO RMARKDOWN STUFF: Rrmarkdown

Repetition of TMA4240/45 Statistics using R

LINK TO THIS: RexamTMA4240H2016, RexamTMA4245V2017?

Further reading