Solutions to Recommended Exercises in Module 4: CLASSIFICATION

TMA4268 Statistical Learning V2018

Julia Debik and Martina Hall, Department of Mathematical Sciences, NTNU

week 5 2018 (Version 02.02.2018)

Theoretical exercises

Bank notes and LDA

Here we have measures of the length and the diagonal of an image part of \(n_G = 500\) genuine bank notes and \(n_F = 500\) false bank notes, with the following mean and covarince matrices, \[ \mu_G = \bar{\bf x}_G=\left[ \begin{array}{c} 214.97 \\ 141.52 \end{array} \right] \text{ and } \hat{\boldsymbol \Sigma}_G=\left[ \begin{array}{cc} 0.1502 & 0.0055 \\ 0.0055 & 0.1998 \end{array} \right] \] \[ \mu_F = \bar{\bf x}_F= \left[ \begin{array}{c} 214.82 \\ 139.45 \end{array} \right] \text{ and } \hat{\boldsymbol \Sigma}_F= \left[ \begin{array}{cc} 0.1240 & 0.0116 \\ 0.0116 & 0.3112 \end{array} \right] \] a. Assuming the observations \(x_G\) and \(x_F\) are independent observations from normal distribution with the same covariance matrix \(\boldsymbol \Sigma = \boldsymbol \Sigma_G = \boldsymbol \Sigma_F\), and all observations are independent of each other, we can find the estimated pooled covariance matrix as \[\begin{align*} \hat{\boldsymbol \Sigma} &= \frac{(n_G - 1)\hat{\boldsymbol \Sigma}_G + (n_f - 1)\hat{\boldsymbol \Sigma}_F}{n_g + n_F - 2} \\ &= \left[ \begin{array}{cc} 0.13710 & 0.00855 \\ 0.00855 & 0.25550 \end{array} \right] \end{align*}\]

b)

For LDA we assume that the class conditional distributions are normal (Gaussian) and that all of the classes have the same covariance matrix. Assuming the same covariance matrix for both classes (G and F), we classify the new observation, \(x_0\) based on which of the discriminant functions that are the largest, \[\delta_k(x) = {\bf x}^T \boldsymbol{\Sigma}^{-1}\boldsymbol\mu_k - \frac{1}{2}\boldsymbol\mu_k^T \boldsymbol{\Sigma}^{-1}\boldsymbol\mu_k + \log \pi_k.\]

We have not been given any information of the prior probabilites - but we would believe that the probability of a fake bank note is much smaller than the probability of a genuine bank note.

However, since our training data is 50% fake and genuine, we might use that to estimate prior probabilites, \(\hat{\pi}_G =\hat{\pi}_F =\frac{n_F}{n} = 0.5.\) Inserting the pooled covariance matrix and the estimated mean values, we have that \[\delta_G({\bf x_0}) = {\bf x_0}^T \hat{\boldsymbol{\Sigma}}^{-1}\boldsymbol\mu_G - \frac{1}{2}\boldsymbol\mu_G^T \hat{\boldsymbol{\Sigma}}^{-1}\boldsymbol\mu_G + \log \pi_G.\] and \[\delta_F({\bf x_0}) = {\bf x_0}^T \hat{\boldsymbol{\Sigma}}^{-1}\boldsymbol\mu_F - \frac{1}{2}\boldsymbol\mu_F^T \hat{\boldsymbol{\Sigma}}^{-1}\boldsymbol\mu_F + \log \pi_F.\]

Alternatively: The rule would be to classify to \(G\) if \(\delta_G({\bf x})-\delta_F({\bf x})>0\), which can be written \[{\bf x_0}^T \hat{\boldsymbol{\Sigma}}^{-1}(\boldsymbol\mu_G -\boldsymbol\mu_F)- \frac{1}{2}\boldsymbol\mu_G^T \hat{\boldsymbol{\Sigma}}^{-1}\boldsymbol\mu_G +frac{1}{2}\boldsymbol\mu_F^T \hat{\boldsymbol{\Sigma}}^{-1}\boldsymbol\mu_F+ (\log \pi_G -\log \pi_F)>0\]

c)

With \({\bf x_0} = [214.0, 140.4]^T\) and using the formula given in the exercise, \(\hat{\boldsymbol{\Sigma}}^{-1} = \left[ \begin{array}{cc} 7.31 & -0.24 \\ -0.24 & 3.92 \end{array} \right]\). Inserting these into the formulas, we have that \[\delta_G({\bf x_0}) = \left[ 214 \text{ } 140.4 \right] \left[ \begin{array}{cc} 7.31 & -0.24 \\ -0.24 & 3.92 \end{array} \right] \left[ \begin{array}{c} 214.97 \\ 141.52 \end{array} \right] - \frac{1}{2} \left[ 214.97 \text{ } 141.52 \right] \ \left[ \begin{array}{cc} 7.31 & -0.24 \\ -0.24 & 3.92 \end{array} \right] \left[ \begin{array}{c} 214.97 \\ 141.52 \end{array} \right] \ - \log 2 = 198667.1\]

and \[\delta_F({\bf x_0}) = \left[ 214 \text{ } 140.4 \right] \left[ \begin{array}{cc} 7.31 & -0.24 \\ -0.24 & 3.92 \end{array} \right] \left[ \begin{array}{c} 214.82 \\ 139.42 \end{array} \right] - \frac{1}{2} \left[ 214.82 \text{ } 139.42 \right] \ \left[ \begin{array}{cc} 7.31 & -0.24 \\ -0.24 & 3.92 \end{array} \right] \left[ \begin{array}{c} 214.82 \\ 139.42 \end{array} \right] \ - \log 2 = 198668.3\] Since \(\delta_F({\bf x_0})\) is larger than \(\delta_G({\bf x_0})\), the classify the bank note as fake!

Exercise 4.7.9

a. Recall that the odds ratio is defined as \[\text{odds} = \frac{p(x)}{1-p(x)},\] where \(p(X) = \text{Pr}(Y=\text{default} | X=x)\). We know that the odds ratio is equal to 0.37. We thus solve for \(p(x)\): \[\begin{align*} \frac{p(x)}{1-p(x)} &= 0.37 \\ p(x) &= 0.37 (1-p(x)) \\ p(x) &= \frac{0.37}{1.37} = 0.270 \end{align*} \] b. For an individual we are given that \(\text{Pr}(Y= \text{default}| X=x) = 0.16\). We are asked to find the odds that she will default. We can calculate this by inserting the probability of default into the formula for the odds ratio: \[\frac{p(x)}{1-p(x)} = \frac{0.16}{1-0.16} = 0.190\]

Exercise 4.7.6

a. We have that \(X_1\) = hours studied = 40 and \(X_2\) = undergrad GPA = 3.5. The formula for the predicted probability is \[\begin{align*}p(Y = A | X_1 = 40, X_2 = 3.5) &= \frac{\exp(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2)}{1 + \exp(\hat{\beta}_0 + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2)} \\ &= \frac{\exp(-6 + 0.05 \cdot 40 + 1 \cdot 3.5)}{1 + \exp(-6 + 0.05 \cdot 40 + 1 \cdot 3.5)}\\ &\approx 0.0378 \end{align*}\]
b. We know that \(x_2\) = 3.5, and need to solve \(\hat{p}(Y = A | x_1, x_2 = 3.5) = 0.5\) for \(x_1\). \[\begin{align*} 0.5 &= \frac{\exp(-6+0.05 x_1 + 3.5)}{1+\exp(-6+0.05 x_1 + 3.5)} \\ 0.5 (1+\exp(-2.5+0.05x_1)) &=\exp(-2.5+0.05x_1) \\ 0.5 &= (1-0.5) \exp(-2.5+0.05x_1)\\ \log(1) &= -2.5+ 0.05x_1 \\ x_1 &= 50 \text{hours} \end{align*}\]

Sensitivity, specificity, ROC and AUC

We start by denoting the number of true diseased as \(P\) and the number of true non-diseased as \(N\), and we have that \(n = P+N\). We count the ones with predicted probability of disease \(p(x)>0.5\), and denote the count \(P^*\) and the count for the ones with \(p(x)\leq0.5\) as \(N^*\). Then, we can make the confusion table

	Predicted non-diseased -	Predicted diseased +	Total
True non-diseased -	TN	FP	N
True diseased +	FN	TP	P
Total	N\(^*\)	P\(^*\)	n

where TN (true negative) is the number of predicted non-diseased that are actually non-diseased, FP (false positive) is the number of predicted diseased that are actually non-diseased, FN (false negative) is the number of predicted non-diseased that are actually diseased and TP (true positive) is the number of predicted diseased that are actually diseased. Using the confusion table, we can calculate the sensitivity and spesificity where \[ Sens = \frac{\text{True positive}}{\text{Actual positive}} = \frac{TP}{P} \] and \[ Spes = \frac{\text{True negative}}{\text{Actual negative}} = \frac{TN}{N} \]

In the ROC-curve we plot the sensitivity against 1-spesificity for all possible thresholds of the probability. To construct the ROC-curve we would have to calculate the sensitivity and spesificity for different values of the cutoff \(p(x)>cut\). Using a threshold of 0.5, you say that if a new person has a probability of 0.51 of having the disease, he is classified as diseased. Another person with a probability of 0.49 would then be classified as non-diseased. Because of this difficulty, the ROC-curve and the area under the ROC-curve are useful tools as they consider all possible thresholds for the cutoff.

The AUC is the area under the ROC-curve and gives the overall performance of the test for all possible thresholds. A AUC value of 1 means a perfect fit for all possible thresholds, while a AUC of 0.5 corresponds to the classifyer that performs no better than chance. Hence, a classification method \(p(x)\) giving 0.6 and another \(q(x)\) giving 0.7, we would prefer \(q(x)\) as it has the highest AUC value.

Data analysis in R

Exercise 4.7.10

a. Install the ISLR package and load the ggplot2 and GGally libraries.

# install.packages("ISLR")
library(ISLR)
library(ggplot2)
library(GGally)

Now make a summary of the Weekly data set using the summary function and pairwise plots of the variables using the ggpairs function:

attach(Weekly)
summary(Weekly)

##       Year           Lag1               Lag2               Lag3         
##  Min.   :1990   Min.   :-18.1950   Min.   :-18.1950   Min.   :-18.1950  
##  1st Qu.:1995   1st Qu.: -1.1540   1st Qu.: -1.1540   1st Qu.: -1.1580  
##  Median :2000   Median :  0.2410   Median :  0.2410   Median :  0.2410  
##  Mean   :2000   Mean   :  0.1506   Mean   :  0.1511   Mean   :  0.1472  
##  3rd Qu.:2005   3rd Qu.:  1.4050   3rd Qu.:  1.4090   3rd Qu.:  1.4090  
##  Max.   :2010   Max.   : 12.0260   Max.   : 12.0260   Max.   : 12.0260  
##       Lag4               Lag5              Volume       
##  Min.   :-18.1950   Min.   :-18.1950   Min.   :0.08747  
##  1st Qu.: -1.1580   1st Qu.: -1.1660   1st Qu.:0.33202  
##  Median :  0.2380   Median :  0.2340   Median :1.00268  
##  Mean   :  0.1458   Mean   :  0.1399   Mean   :1.57462  
##  3rd Qu.:  1.4090   3rd Qu.:  1.4050   3rd Qu.:2.05373  
##  Max.   : 12.0260   Max.   : 12.0260   Max.   :9.32821  
##      Today          Direction 
##  Min.   :-18.1950   Down:484  
##  1st Qu.: -1.1540   Up  :605  
##  Median :  0.2410             
##  Mean   :  0.1499             
##  3rd Qu.:  1.4050             
##  Max.   : 12.0260

ggpairs(Weekly, ggplot2::aes(color=Direction), lower = list(continuous = wrap("points", alpha = 0.3, size=0.2)))

We can observe that the variables Year and Volume are highly correlated and it might look like the Volume increases quadratically with increasing Year. No other clear patterns is observed.

b. We fit a logistic regression model using the glm function and specifying that we want to fit a binomial function by giving function="binomial" as an argument.

glm.Weekly = glm(Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + Volume, family="binomial")
summary(glm.Weekly)

## 
## Call:
## glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 + 
##     Volume, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6949  -1.2565   0.9913   1.0849   1.4579  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  0.26686    0.08593   3.106   0.0019 **
## Lag1        -0.04127    0.02641  -1.563   0.1181   
## Lag2         0.05844    0.02686   2.175   0.0296 * 
## Lag3        -0.01606    0.02666  -0.602   0.5469   
## Lag4        -0.02779    0.02646  -1.050   0.2937   
## Lag5        -0.01447    0.02638  -0.549   0.5833   
## Volume      -0.02274    0.03690  -0.616   0.5377   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1496.2  on 1088  degrees of freedom
## Residual deviance: 1486.4  on 1082  degrees of freedom
## AIC: 1500.4
## 
## Number of Fisher Scoring iterations: 4

Only Lag2 appears to be a significant predictor.

c. We use our fitted model to calculate the probabilities for Direction="Up" for the response variable and compare the predictions with the true classes of the response variable. To find the confusion matrix, the function table can be used.

glm.probs_Weekly = predict(glm.Weekly, type="response")
glm.preds_Weekly = ifelse(glm.probs_Weekly > 0.5, "Up", "Down")
table(glm.preds_Weekly, Direction)

##                 Direction
## glm.preds_Weekly Down  Up
##             Down   54  48
##             Up    430 557

We see that this classifier makes bad predictions. The fraction of correct predictions is \[\frac{54+557}{1089} = 0.561. \] From the confusion matrix we also see that the classifier does a good job predicting when the market goes Up, but a poor job predicting the market goes Down.

d. We start by dividing the Weekly data set into a train and a test set, where the training set consists of all observations in the period from 1990 to 2008, while the test set consists of the observations from the period 2009 to 2010. We then fit a logistic regression model to the training data set, make predictions for the test set, and calculate the confusion matrix.

Weekly_trainID = (Year < 2009)
Weekly_train = Weekly[Weekly_trainID,]
Weekly_test = Weekly[!Weekly_trainID,]

glm.Weekly2 = glm(Direction~Lag2, family="binomial", data=Weekly_train)
glm.Weekly2_prob = predict(glm.Weekly2, newdata=Weekly_test, type="response")
glm.Weekly2_pred = ifelse(glm.Weekly2_prob > 0.5, "Up", "Down")
table(glm.Weekly2_pred, Weekly_test$Direction)

##                 
## glm.Weekly2_pred Down Up
##             Down    9  5
##             Up     34 56

The fraction of correct predictions on the test set is \[\frac{9+56}{104} = 0.625.\]

e. The lda function is available in the MASS library, thus we need to start by loading the library. We proceed by fitting a LDA model to the training set. We then use this model to make predictions for the test set and then compare the predicted and true classes using the confusion matrix.

library(MASS)
lda.Weekly = lda(Direction~Lag2, data=Weekly_train)
lda.Weekly_pred = predict(lda.Weekly, newdata=Weekly_test)$class
table(lda.Weekly_pred, Weekly_test$Direction)

##                
## lda.Weekly_pred Down Up
##            Down    9  5
##            Up     34 56

The fraction of correct classifications is \[\frac{9+56}{104} = 0.625.\]

g. We follow the same procedure and test an QDA classifier on the data:

qda.Weekly = qda(Direction~Lag2, data=Weekly_train)
qda.Weekly_pred = predict(qda.Weekly, newdata=Weekly_test)$class
table(qda.Weekly_pred, Weekly_test$Direction)

##                
## qda.Weekly_pred Down Up
##            Down    0  0
##            Up     43 61

The fraction of correct classifications is now \[\frac{0+61}{104} = 0.587.\]

h. The KNN classifier is implemented in the knn function of the class library. This function requires some preparation of the data, as it does not accept a formula as a function argument.

library(class)
knn.train = as.matrix(Weekly_train$Lag2)
knn.test = as.matrix(Weekly_test$Lag2)

knn1.Weekly = knn(train = knn.train, test = knn.test, cl = Weekly_train$Direction, k=1)
table(knn1.Weekly, Weekly_test$Direction)

##            
## knn1.Weekly Down Up
##        Down   21 30
##        Up     22 31

The fraction of correct classifications for the KNN classifier is \[\frac{21+31}{104}=0.510\]

h. The logistic regression model and the LDA classifier provided the highest fractions of correct classifications on this data.

i. No solution provided here.

Exercise 4.7.11

a. Let’s look at the data

attach(Auto)
head(Auto)

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

We create a the binary variable mpg01 using the median and ifelse function. We continue by making a data frame, where the original mpg variable is replaced by the mpg01 variable, and where all other covariates, except the name, are included.

mpg.median = median(Auto$mpg)
mpg01 = ifelse(Auto$mpg > mpg.median, 1, 0)
auto = data.frame(mpg01 = mpg01, Auto[,2:8])
head(auto)

##   mpg01 cylinders displacement horsepower weight acceleration year origin
## 1     0         8          307        130   3504         12.0   70      1
## 2     0         8          350        165   3693         11.5   70      1
## 3     0         8          318        150   3436         11.0   70      1
## 4     0         8          304        150   3433         12.0   70      1
## 5     0         8          302        140   3449         10.5   70      1
## 6     0         8          429        198   4341         10.0   70      1

b. We can explore the correlations in the data set graphically using the corrplot library. The size of the circle indicates the absolute value of the correlation, while the color indicates whether the correlation is positive (blue) or negative (red).

# install.packages(corrplot)
library(corrplot)
auto.cor = cor(auto)
corrplot(auto.cor,  tl.col="black")

The response variable mpg01 has the highest correlation with the variables cylinders, displacement, horsepowerand weight. All of these correlations are negative. We also see that the covariates are highly correlated with each other.

Pairwise scatter plots of the variables can be made using the ggpairs function from the GGally library.

auto$mpg01 = as.factor(auto$mpg01)
ggpairs(auto, ggplot2::aes(color=mpg01), lower = list(continuous = wrap("points", alpha = 0.3, size=0.2)), upper="blank")

The variables cylinders, displacement, horsepower and weight have a high influence on the response value. There is no visible pattern between the variables acceleration and year and the response variable.

Boxplots of the variables

# install.packages(reshape2)
require(reshape2)
auto.melt= melt(auto, id.var="mpg01" )
ggplot(data = auto.melt, aes(x=variable, y=value)) + geom_boxplot(aes(fill=mpg01)) + facet_wrap( ~ variable, scales="free")

c. We split the data randomly into a training set and a test set of equal size by using the function sample. We set a seed so that the results are reproducible.

set.seed(100)
n = dim(auto)[1]
auto.trainID = sample(1:n, size = n/2)
auto.train = auto[auto.trainID, ]
auto.test = auto[-auto.trainID, ]

auto.lda = lda(mpg01~cylinders+displacement+horsepower+weight, data=auto.train)
auto.lda.pred = predict(auto.lda, newdata=auto.test)$class
# Test error
mean(auto.lda.pred != auto.test$mpg01)

## [1] 0.1173469

auto.qda = lda(mpg01~cylinders+displacement+horsepower+weight, data=auto.train)
auto.qda.pred = predict(auto.qda, newdata=auto.test)$class
# Test error
mean(auto.qda.pred != auto.test$mpg01)

## [1] 0.1173469

auto.glm = glm(mpg01~cylinders+displacement+horsepower+weight, data=auto.train, family="binomial")
auto.glm.prob = predict(auto.glm, newdata=auto.test, type="response")
auto.glm.pred = ifelse(auto.glm.prob > 0.5, 1, 0)
# Test error
mean(auto.glm.pred != auto.test$mpg01)

## [1] 0.1020408

g. We test the performance of the KNN classifier, by trying all values of \(K\) from 0 to 50.

auto.knn.train = as.matrix(auto.train[,2:8])
auto.knn.test = as.matrix(auto.test[,2:8])

K = 50
auto.knn.error = rep(NA, K)

for(k in 1:K){
  auto.knn.pred = knn(train = auto.knn.train, test = auto.knn.test, cl=auto.train$mpg01, k = k)
  auto.knn.error[k] = mean(auto.knn.pred != auto.test$mpg01)
}

knn.error.df = data.frame(k=1:K, error = auto.knn.error)
ggplot(knn.error.df, aes(x=k, y=error))+geom_point(col="blue")+geom_line(linetype="dotted")

knn.min = min(auto.knn.error)
# Test error
knn.min

## [1] 0.1173469

which(auto.knn.error == knn.min)

## [1]  5 45

Sensitivity, specificity, ROC and AUC

#install.packages("DAAG")
library(DAAG)
attach(frogs)

glm.fit = glm(pres.abs~distance + NoOfPools + meanmin, data=frogs, family="binomial")
summary(glm.fit)

## 
## Call:
## glm(formula = pres.abs ~ distance + NoOfPools + meanmin, family = "binomial", 
##     data = frogs)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8263  -0.8011  -0.4572   0.9100   2.8734  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.6332534  1.1464897  -4.041 5.32e-05 ***
## distance    -0.0006007  0.0001731  -3.471 0.000518 ***
## NoOfPools    0.0251223  0.0080723   3.112 0.001857 ** 
## meanmin      1.3438182  0.3087004   4.353 1.34e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 279.99  on 211  degrees of freedom
## Residual deviance: 216.10  on 208  degrees of freedom
## AIC: 224.1
## 
## Number of Fisher Scoring iterations: 6

glm.fit.pred = ifelse(fitted.values(glm.fit)>0.5, 1, 0)
table(glm.fit.pred, frogs$pres.abs)

##             
## glm.fit.pred   0   1
##            0 114  29
##            1  19  50

The training error can be calculated by counting the misclassification rate. \[\text{Error}_\text{train} = \frac{29+19}{212} \approx 0.23 \] The training error among the present is \[\text{Error}_\text{present} = \frac{29}{114+29} \approx 0.20 \] and the training error among the absent is \[\text{Error}_\text{absent} = \frac{19}{19+50} \approx 0.28 \]

library(pROC)
full.roc = roc(frogs$pres.abs, fitted.values(glm.fit))
auc(full.roc)

## Area under the curve: 0.8188

ggroc(full.roc)

library(class)
lfit = lda(pres.abs~distance + NoOfPools + meanmin, data=frogs)
lpredClass=predict(object=lfit)$class
table(lpredClass, frogs$pres.abs)

##           
## lpredClass   0   1
##          0 111  41
##          1  22  38

The training error can be calculated by counting the misclassification rate. \[\text{Error}_\text{train} = \frac{41+22}{212} \approx 0.30 \] The training error among the present is \[\text{Error}_\text{present} = \frac{41}{111+41} \approx 0.27 \] and the training error among the absent is \[\text{Error}_\text{absent} = \frac{22}{22+38} \approx 0.37 \]

library(pROC)
lpred=predict(object=lfit)$posterior[,1]
lres=roc(response=frogs$pres.abs,predictor=lpred)
auc(lres)

## Area under the curve: 0.7969

ggroc(lres)

d. Using the same data for training and testing the model will always give better (or sometimes the same) predicted values than using a seperate data set for testing. The model is fitted so that they will give the best prediction on these data! Hence, the ROC curve and AUC values we get here will be the best possible we can get for this model, and it would be more realistic to evaluate the ROC-curve and AUC value for a test set.