R packages
Problem 1: Regression [6 points]
Problem 2: Unsupervised learning [3 points]
- a) Understanding PCA
- b) Hierarchical clustering
Problem 3: Flying solo with diabetes data [6 points]
References

(Last changes: 07.05)

R packages

You need to install the following packages in R to run the code in this file.

install.packages("knitr") #probably already installed
install.packages("rmarkdown") #probably already installed
install.packages("bestglm")# for subset selection with categorical variables
install.packages("glmnet")# for lasso
install.packages("tree") #tree
install.packages("randomForest") #for random forest
install.packages("ElemStatLearn") #dataset in Problem 2
BiocManager::install(c("pheatmap")) #heatmap in Problem 2

The following are packages that might be useful.

install.packages("ggplot2")
install.packages("GGally") # for ggpairs
install.packages("caret") #for confusion matrices
install.packages("pROC") #for ROC curves
install.packages("e1071") # for support vector machines
install.packages("nnet") # for feed forward neural networks

Problem 1: Regression [6 points]

In this problem we will work with understanding and applying different regression methods, and will use the diamonds data set from the ggplot2 package. The diamonds data set consists of

price (in US dollars) - which will be our response,

and quality information (9 covariates) for around 54000 diamonds. According to Wickham (2016) (Section 3.10) there are four C’s of diamond quality:

carat (weight),
cut (quality of the cut: Fair/Good/VeryGood/Premium/Ideal),
colour (from worst J to best D) and
clarity (from worst to best: I1, SI1, SI2, VS1, VVS1, VVS2, IF).

In addition there are five physical measurements:

depth (total depth percentage, calculated from x, y and z),
table (width of top of diamond relative to widest point),
xx (length in mm),
yy (width in mm) and
zz (depth in mm)

explained with a drawing in Figure 3.1 of the Wickham (2016) book (ebook available for free for NTNU students). There are no missing data. In addition we have made two additional transformed variables:

logprice: logarithm (base 10) of the price, and
logcarat: logarithm (base 10) of the carat.

To be able to do the analyses without too much hassle (running time for analyses and plotting) we constructed a smaller version of the data set, and divided it into a training set dtrain and a test set dtest, to be loaded from the course web-page as follows.

all=dget("https://www.math.ntnu.no/emner/TMA4268/2019v/data/diamond.dd")
dtrain=all$dtrain
dtest=all$dtest

Our aim will be to predict the price of the diamond, based on some or all of the covariates. We would also like to understand which of the covariates are important in predicting the price and how the covariates are related to the price.

a) Know your data!

Q1: Would you choose price or logprice as response variable? Justify your choice. Next, plot your choice of response pairwise with carat, logcarat, color, clarity and cut. Comment.

Answer

par(mfrow=c(1,2))
hist(dtrain$price)
hist(dtrain$logprice)

plot(dtrain$carat,dtrain$logprice)
plot(dtrain$logcarat,dtrain$logprice)

par(mfrow=c(3,1), mar = c(4,3,0,0), mgp=c(2, 1, 0))
plot(logprice~color,data=dtrain)
plot(logprice~clarity,data=dtrain)
plot(logprice~cut,data=dtrain)

par(mfrow=c(1,1))

We prefer logprice as response. The variable price is positive and right-skewed. The log-transform is not skewed and is likely to normalize the variance across observations. It is also convenient that predictions for price always are positive when using the log-transform.
Analysing logprice means that we have a multiplicative model in the covariates.
The relationship between logprice and logcarat is approximately linear.
The boxplots do not show the patterns we might expect. However, pairwise plots can be noisy.
logprice is increasing with the letter of color while we would expect the opposite.
The levels SI2, VVS1 and IF of clarity has many outliers of logprice which make interpretation hard. Still, clarity seems like an important variable to include because of large differences in logprice.
The factor cut might be less important and it is surprising to see the smallest average for Ideal.

b) Modelling `logprice` and `carat`

Use the local regression model $\texttt{logprice} = \beta_0 + \beta_1 \texttt{carat}+ \beta_2 \texttt{carat}^2$ weighted by the tricube kernel $K_{i0}$.

Q2: What is the predicted price of a diamond weighting 1 carat. Use the closest 20% of the observations.
Q3: What choice of $\beta_1$, $\beta_2$ and $K_{i0}$ would result in KNN-regression?

Answer

plot(dtrain$carat,dtrain$logprice)
fit=loess(dtrain$logprice~dtrain$carat,span=0.2,degree=2)
pred=predict(fit)
lines(dtrain$carat[order(dtrain$carat)],pred[order(dtrain$carat)],col=2,lwd=2)

The predicted price is computed below.

10^predict(loess(logprice~carat, span=.2, data=dtrain), 1)

## [1] 5100.674

KNN-regression corresponds to $\beta_1=\beta_2=0$ and $K_{i0}=1$ for $x_i\in \mathcal N_0$ and $K_{i0}=0$ otherwise, where $\mathcal N_0$ is the 20% of caret closest to 1. The $\beta_0$ is then estimated by minimizing $\sum_{i=1}^n K_{i0}(y_i-\beta_0)^2$, which will give $\hat{\beta}_0$ to be the mean of the response of 20% closes observations.

c) Best subset selection

Q4: Describe how you can perform model selection in regression with AIC as criterion.
Q5: What are the main differences between using AIC for model selection and using cross-validation (with mean squared test error MSE)?
Q6: See the code below for performing model selection with bestglm() based on AIC. What kind of contrast is used to represent cut, color and clarity? Write down the final best model and explain what you can interpret from the model.
Q7: Calculate and report the MSE of the test set using the best model (on the scale of logprice).

library(bestglm)
ds=as.data.frame(within(dtrain,{
  y=logprice    # setting reponse
  logprice=NULL # not include as covariate
  price=NULL    # not include as covariate
  carat=NULL    # not include as covariate
  }))
fit=bestglm(Xy=ds, IC="AIC")$BestModel
summary(fit)

## 
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE), 
##     drop = FALSE], y = y))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31283 -0.03717  0.00033  0.03564  0.58991 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.985969   0.042993  69.452  < 2e-16 ***
## logcarat      1.580675   0.029042  54.428  < 2e-16 ***
## cutGood       0.028558   0.005747   4.970 6.93e-07 ***
## cutVery Good  0.040480   0.005285   7.660 2.22e-14 ***
## cutPremium    0.042984   0.005295   8.118 5.90e-16 ***
## cutIdeal      0.054829   0.005223  10.498  < 2e-16 ***
## colorE       -0.021787   0.003031  -7.189 7.51e-13 ***
## colorF       -0.038892   0.003107 -12.519  < 2e-16 ***
## colorG       -0.067999   0.003024 -22.488  < 2e-16 ***
## colorH       -0.112815   0.003235 -34.877  < 2e-16 ***
## colorI       -0.165072   0.003686 -44.783  < 2e-16 ***
## colorJ       -0.223651   0.004543 -49.233  < 2e-16 ***
## claritySI2    0.174988   0.007933  22.060  < 2e-16 ***
## claritySI1    0.251631   0.007876  31.950  < 2e-16 ***
## clarityVS2    0.315556   0.007924  39.824  < 2e-16 ***
## clarityVS1    0.346409   0.008051  43.027  < 2e-16 ***
## clarityVVS2   0.402528   0.008233  48.894  < 2e-16 ***
## clarityVVS1   0.433326   0.008549  50.684  < 2e-16 ***
## clarityIF     0.473681   0.009154  51.745  < 2e-16 ***
## xx            0.069331   0.006610  10.489  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05978 on 4980 degrees of freedom
## Multiple R-squared:  0.9815, Adjusted R-squared:  0.9814 
## F-statistic: 1.39e+04 on 19 and 4980 DF,  p-value: < 2.2e-16

Answer

With a few covariates we may investigate all possible models - which is $2^p$ models if we have $p$ covariates. For each model we fit the model on the training data and then use the AIC to compare all models. The AIC is given as $-2$loglikelihood $+2p$, so it gives a penalty to the model fit dependent on the number of covariates fitted. This makes a validation set unesseccary.
If we want to use the MSE to select the best model we will always choose the largest model, since the MSE can not descrease when now covariates are added to the model. Therefore we need to use a validation set to calculate the test set MSE, and then choose a model based on this measure. We then need not penalize the model complexity since we are calculating the test set error.
For the (unordered) categorical covariates a dummy variable coding is used. This is called treatment-contrast. That can be seen by attr(model.matrix(fit),"contrasts").
Model and interpretation.
MSE below.

attr(model.matrix(fit),"contrasts")
pred=predict(fit,newdata=dtest)
testerror=mean((pred-dtest$logprice)^2)
testerror

## $cut
## [1] "contr.treatment"
## 
## $color
## [1] "contr.treatment"
## 
## $clarity
## [1] "contr.treatment"
## 
## [1] 0.003600305

d) Variable selection with the lasso

Q8: Build a model matrix for the covariates ~logcarat+cut+clarity+color+depth+table+xx+yy+zz-1. What is the dimension of this matrix?
Q9: Fit a lasso regression to the diamond data with logprice as the response and the model matrix given in Q8. How did you find the value to be used for the regularization parameter?
Q10: Calculate and report the MSE of the test set (on the scale of logprice).

Answer

library(glmnet)
model = formula(~logcarat+cut+clarity+color+depth+table+xx+yy+zz-1)
mm = model.matrix(model,data=dtrain)
# alternatively model.matrix(~logcarat+cut+clarity+color+depth+table+xx+yy+zz-1,data=dtrain)
set.seed(4268)
fit = cv.glmnet(mm,dtrain$logprice)
plot(fit)

pred = predict(fit, newx=model.matrix(model,data=dtest))
testerror=mean((pred-dtest$logprice)^2)
coef(fit)
testerror
fit$lambda.1se

## 25 x 1 sparse Matrix of class "dgCMatrix"
##                          1
## (Intercept)   3.100529e+00
## logcarat      1.560376e+00
## cutFair      -5.049215e-02
## cutGood      -1.609956e-02
## cutVery Good -4.339380e-03
## cutPremium    .           
## cutIdeal      1.016087e-02
## claritySI2    7.482682e-02
## claritySI1    1.504417e-01
## clarityVS2    2.134576e-01
## clarityVS1    2.429941e-01
## clarityVVS2   2.980627e-01
## clarityVVS1   3.273970e-01
## clarityIF     3.664982e-01
## colorE       -1.409037e-02
## colorF       -3.039127e-02
## colorG       -5.865112e-02
## colorH       -1.038048e-01
## colorI       -1.559022e-01
## colorJ       -2.140430e-01
## depth         .           
## table         8.269798e-05
## xx            1.396869e-02
## yy            5.669450e-02
## zz            2.085783e-03
## [1] 0.003839289
## [1] 0.0004755529

The model matrix is created above and has dimensions 5000 $\times$ 24.
The choice of $\lambda$ is 0.0004755529, which is the largest $\lambda$ with a CV error within a standard error of the smallest CV error. The model in c is more complex. It has 4 more predictors and no shrinkage of the coefficients. The best subset selection only drops depth, table, yy and zz. The lasso drops depth, table, zz, 2 predictors for cut, 3 predictors for clarity and 3 predictors for color.
The penalty choice is $\lambda = 0.0004755529$, which gives an MSE of the test set of 0.0038393 which is slightly larger than in c.

e) Regression tree

Q11: A regression tree to model is built using a greedy approach. What does that mean? Explain the strategy used for constructing a regression tree.
Q12: Is a regression tree a suitable method to handle both numerical and categorical covariates? Elaborate.
Q13: Fit a (full) regression tree to the diamond data with logprice as the response (and the same covariates as for c and d), and plot the result. Comment briefly on you findings.
Q14: Calculate and report the MSE of the test set (on the scale of logprice).

Answer

library(tree)
library(RColorBrewer)
fit = tree(logprice~logcarat+cut+color+clarity+depth+table+xx+yy+zz,data=dtrain)
pred = predict(fit, newdata=dtest)
testerror = mean((pred-dtest$logprice)^2)
plot(fit, type="uniform")
text(fit, pretty=1)

par(mar = c(5.1, 4.1, 1.1, 8.1), xpd = TRUE, pty = "s")
y = cut(dtrain$logprice, breaks = 7)
col.y = brewer.pal(9, "BuGn")
palette(col.y[3:9])
plot(xx ~ yy, col = y, data = dtrain, pch = 20)
partition.tree(fit, add = TRUE)
legend("topright", inset = c(-0.4, 0.2), title = "Logprice", legend = levels(y),
    col = col.y[3:9], pch = 20, box.col = NA)

palette("default")

The tree is built greedily because it chooses the best solution at each split and not the optimal global solution.
The regression tree is built by recursive binary splitting. Repeatedly split a region in one dimension such that the sum of TSS within the new regions is minimal. Stop splitting a region when it contains few observations. Select a penalty on the number of regions by CV. The prediction in a region is the average. For classification, replace TSS by an impurity measure and the average by a majority vote.
The tree has three splits for yy and one for xx. This model is much simpler than the linear models discussed before, but also less accurate.
The testerror is 0.0171555, which is four times larger than for the previous models.

f) Random forest

Q15: Explain the motivation behind bagging, and how bagging differs from random forest? What is the role of bootstrapping?
Q16: What are the parameter(s) to be set in random forest, and what are the rules for setting these?
Q17: Boosting is a popular method. What is the main difference between random forest and boosting?
Q18: Fit a random forest to the diamond data with logprice as the response (and the same covariates as before). Comment on your choice of parameter (as decribed in Q16).
Q19: Make a variable importance plot and comment on the plot. Calculate and report the MSE of the test set (on the scale of logprice).

Answer

library(randomForest)
set.seed(4268)
rf=randomForest(logprice~logcarat+cut+color+clarity+depth+table+xx+yy+zz,
                data=dtrain,
                mtry=3,ntree=1000,importance=TRUE)
yhat.rf=predict(rf,newdata=dtest)
testerror=mean((yhat.rf-dtest$logprice)^2)
varImpPlot(rf,pch=20)

The motivation behind bagging is to reduce the variance. This is achived by averaging many trees. We make bootstrap samples and fit each by a tree as before. Random forest is similar, but the trees are decorrelated by considering only some predictors at each split randomly.
We increase the number of trees until convergence in the test error. The recommended number of predictors to consider at each split is $\sqrt p$ for classification and $p/3$ for regression with $p$ predictors.
In boosting we use information about the last fitted tree to make a new one.
We consider $p/3 = 9/3 = 3$ predictors since this is regression. For the trees we do not need to use dummy-variable coding, because the tree handles categorical covariates.
The left plot suggest that clarity and color are important for predictive power. From the right plot we observe that variance in the response is well explained by yy, logcarat, xx and zz. The testerror is 0.0027789.

g) Comparing results

Q20: Finally, compare the results from c (subset selection), d (lasso), e (tree) and f (random forest): Which method has given the best performance on the test set and which method has given you the best insight into the relationship between the price and the covariates?

Answer

Random forest has the lowest test error. However, a single tree or a linear model is easier to interpret.
The results from lasso might be hard to interpret when only some levels of a variable is included.

Problem 2: Unsupervised learning [3 points]

We will use a data set called nci from the Friedman, Hastie, and Tibshirani (2001) book. The data set consists of measurement on the activity of $p=6830$ genes for $n=64$ samples from cancer tumors. Our aim here is to use unsupervised methods to study the relationship among the cancer samples. The data is on a transformed scale, where a negative number means that the gene has a lower regulation than some reference sample and a positive number gives a higer regularization. The following names are given for the cancer tumors (R print-out). It is known that K562 cells are leukemia cells and MCF7 are breast cancer cells. Observe the one sample labelled as UNKNOWN. More information: http://genome-www.stanford.edu/nci60/

Let ${\bf X}$ be a $n\times p$ matrix of these observations, and ${\bf Z}$ a standardized version, where column (gene) means are 0 and column standard deviations are 1. Further, let ${\bf R}$ be the $p \times p$ covariance matrix of the gene-standardized observations, which can be estimated as

\[ {\hat {\bf R}}=\frac{1}{n-1}\sum_{i=1}^n ({\bf Z}_i-\bar{\bf Z})({\bf Z}_i-\bar{\bf Z})^T\] where $\bar{\bf Z}=\frac{1}{n}\sum_{i=1}^n {\bf Z}_i$.

Studying the relationship between the different tumor samples is hard due the the high dimension of the gene data, and we know that the gene expressions of different genes may be correlated. One way to proceed is to calculate principal components, which we find using the eigenvectors of the covariance matrix ${\hat {\bf R}}$.

a) Understanding PCA

Q21: What is the definition of a principal component score, and how is the score related to the eigenvectors of the matrix ${\hat {\bf R}}$.

Study the R-code below. The eigenvectors of ${\hat {\bf R}}$ are found in pca$rotation and the principal component scores in pca$x.

Q22: Explain what is given in the plot with title “First eigenvector”. Why are there only $n=64$ eigenvectors and $n=64$ principal component scores?

Q23: How many principal components are needed to explain 80% of the total variance in ${\bf Z}$? Why is sum(pca$sdev^2)=p?

Hint: Total variance is given as the trace of ${\hat {\bf R}}$, the trace of a square matrix is the sum of the diagonal elements.

Q24: Study the PC1 vs PC2 plot, and comment on the groupings observed. What can you say about the placement of the K262, MCF7 and UNKNOWN samples? Produce the same plot for two other pairs of PCs and comment on your observations.

library(ElemStatLearn)
X=t(nci) #n times p matrix
table(rownames(X))
ngroups=length(table(rownames(X)))
cols=rainbow(ngroups)
cols[c(4,5,7,8,14)] = "black"
pch.vec = rep(4,14)
pch.vec[c(4,5,7,8,14)] = 15:19

colsvsnames=cbind(cols,sort(unique(rownames(X))))
colsamples=cols[match(rownames(X),colsvsnames[,2])] 
pchvsnames=cbind(pch.vec,sort(unique(rownames(X))))
pchsamples=pch.vec[match(rownames(X),pchvsnames[,2])]

Z=scale(X)

pca=prcomp(Z)
plot(pca$x[,1],pca$x[,2],xlab="PC1",ylab="PC2",pch=pchsamples,col=colsamples)
legend("bottomright",legend = colsvsnames[,2],cex=0.55,col=cols,pch = pch.vec)

plot(1:dim(X)[2],pca$rotation[,1],type="l",xlab="genes",ylab="weight",main="First eigenvector")

## 
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
##           7           5           7           1           1           6 
## MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
##           1           1           8           9           6           2 
##       RENAL     UNKNOWN 
##           9           1

Answers:

dim(pca$rotation)
dim(pca$x)

## [1] 6830   64
## [1] 64 64

Q21: The score $i$ of the $m-th$ principal component is the projection of $i$ data point on the direction of $m-th$ principal component, and is given by $w_{im} = \phi_{1m}z_{i1}+\phi_{2m}z_{i2}+\cdots+\phi_{pm}z_{ip},$ where $\phi_{1m},\phi_{2m},\cdots,\phi_{pm}$ are the elements of the $m-th$ eigenvector of ${\hat {\bf R}}.$

The eigenvectors of ${\hat {\bf R}}$ are used as weights on the original observations to produce the principal component score.

Q22: The “First eigenvector” plot shows the contribution of variables (genes) to the first principal component.

There are $min(n-1,p) = 63$ principal components. In R we get $64$ due to the numerical method that is used. Check that the variation of $64-th$ principal component (pca$sdev[64]) is almost $0$.

Q23:

var = pca$sdev^2
prop_var = cumsum(var)/sum(var)
prop_var80 = min(which(prop_var>=0.8))
prop_var80
plot(1:length(prop_var), prop_var, type = 'b', xlim = c(0,64),xlab="Principal Component", 
ylab="Cumulative Proportion of Variance Explained")
abline(v = prop_var80, lty = 2)

## [1] 32

Each variable (column) in the standardized version of data ${\bf Z}$ has variance $1$, which means that the total variance contained in data ${\bf Z}$ is $p \times 1=p.$ The $64$ PCs capture all the variability contained in the data ${\bf Z}$ which is $p.$

Q24: We observe two groups, one consists ofK262A-repro-K262B-repro and the other one consists of MCF7A-repro-MCF7D-repro.

b) Hierarchical clustering

Hierarchical clustering can be used to group the cancer samples, and then we may see which samples that are most similar to the five samples named K562, MCF7 and UNKNOWN.

Q25:: Explain what it means to use Euclidean distance and average linkage for hierarchical clustering.

Q26:: Perform hierarchical clustering with Euclidean distance and average linkage on the scaled gene expression in Z. Observe where our samples labelled as K562, MCF7 and UNKNOWN are placed in the dendrogram. Which conclusions can you draw from this?

Q27:: Study the R-code and plot below. Here we have performed hierarchical clustering based on thefirst 64 principal component instead of the gene expression data in Z. What is the difference between using all the gene expression data and using the first 64 principal components in the clustering? We have plotted the dendrogram together with a heatmap of the data. Explain what is shown in the heatmap. What is given on the horizontal axis, vertical axis, value in the pixel grid?

library(pheatmap)
npcs=64 
pheatmap(pca$x[,1:npcs],scale="none",cluster_col=FALSE,cluster_row=TRUE,clustering_distance_rows = "euclidean",clustering_method="average",fontsize_row=5,fontsize_col=5)

Answers:

Q25: The Euclidean distance is a used as a measure of similarity between two observations while the average linkage is a measure of similarity between two clusters. In average linkage we compute all pairwise distances between the observations in cluster A and in cluster B, and we record the average of these dissimilarities.

Q26:

samplenames=rownames(Z)
data.dist=dist(Z)
plot(hclust(data.dist, method="average"), labels=samplenames, main="Average Linkage", xlab="", sub="",ylab="",cex=0.4)

Q27: The clustering result using all available principal components is the same as using the original data. This is because the $64$ principal components explain all the variability of the original data.

pc.dist = dist(pca$x[,1:npcs])
pc.samplenames = rownames(pca$x)
plot(hclust(pc.dist, method="average"), labels=pc.samplenames, main="Average Linkage", xlab="", sub="",ylab="",cex=0.4)

Each pixel value is the score of the observation (vertical axis) on the corresponding principal component (horizontal axis). For example the red pixel (score) on the first row-second column has a value close to 60, while the blue pixel on the $63$ row-first column has a value close to $-60$. Furthermore the first column values correspond to the score vector of the first principal coimponent.

Problem 3: Flying solo with diabetes data [6 points]

A learning objective in this course is: The student knows, based on an existing data set, how to choose a suitable statistical model, apply sound statistical methods, and perform the analyses using statistical software. The student knows how to present the results from the statistical analyses, and which conclusions can be drawn from the analyses.

This problem addresses that objective.

We will use the classical data set of diabetes from a population of women of Pima Indian heritage in the US, available in the R MASS package. The following information is available for each woman:

diabetes: 0= not present, 1= present
npreg: number of pregnancies
glu: plasma glucose concentration in an oral glucose tolerance test
bp: diastolic blood pressure (mmHg)
skin: triceps skin fold thickness (mm)
bmi: body mass index (weight in kg/(height in m)$^2$)
ped: diabetes pedigree function.
age: age in years

We will use a training set (called ctrain) with 300 observations (200 non-diabetes and 100 diabetes cases) and a test set (called ctest) with 232 observations (155 non-diabetes and 77 diabetes cases). Our aim is to make a classification rule for diabetes (or not) based on the available data.

flying=dget("https://www.math.ntnu.no/emner/TMA4268/2019v/data/flying.dd")
ctrain=flying$ctrain
ctest=flying$ctest

Q28: Start by getting to know the training data, by producing summaries and plots. Write a few sentences about what you observe and include your top 3 informative plots and/or outputs.

# not to be included
library(ggplot2)
library(GGally)
library(corrplot)
ctrain.tmp=ctrain
ctrain.tmp$diabetes=as.factor(ctrain.tmp$diabetes)
ctrain.tmp$npreg=as.factor(ctrain.tmp$npreg)
ggpairs(ctrain.tmp,cardinality_threshold = 16)

summary(ctrain)
table(ctrain$npreg)
corrplot(cor(ctrain))

##     diabetes          npreg             glu              bp        
##  Min.   :0.0000   Min.   : 0.000   Min.   : 57.0   Min.   : 24.00  
##  1st Qu.:0.0000   1st Qu.: 1.000   1st Qu.:100.0   1st Qu.: 64.00  
##  Median :0.0000   Median : 2.000   Median :117.0   Median : 70.00  
##  Mean   :0.3333   Mean   : 3.437   Mean   :122.7   Mean   : 71.23  
##  3rd Qu.:1.0000   3rd Qu.: 5.000   3rd Qu.:143.0   3rd Qu.: 78.00  
##  Max.   :1.0000   Max.   :17.000   Max.   :199.0   Max.   :110.00  
##       skin            bmi             ped              age       
##  Min.   : 7.00   Min.   :18.20   Min.   :0.0850   Min.   :21.00  
##  1st Qu.:21.00   1st Qu.:27.88   1st Qu.:0.2532   1st Qu.:23.00  
##  Median :29.00   Median :32.90   Median :0.4075   Median :28.00  
##  Mean   :29.22   Mean   :33.09   Mean   :0.4789   Mean   :31.48  
##  3rd Qu.:37.00   3rd Qu.:37.12   3rd Qu.:0.6525   3rd Qu.:37.25  
##  Max.   :60.00   Max.   :67.10   Max.   :2.1370   Max.   :70.00  
## 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 15 17 
## 37 76 43 34 22 16 16 17 10 12  6  5  3  1  1  1

All covariates seem to be somewhat related to diabetes, and from the boxplot, glu seems to be a strong predictor.

The different covariates have very different scales, and at least for SVM and neural networks it is important to normalized the data (center and scale), and that this is done based on summary statistics on the training data. However, remember that when performing cross-validation we need to do this scaling only based on the training part.

The only covariate that is not strictly numerical is npreg. There is a high correlation between npreg and age, of 0.65. Of the other covariates are not very correlated, except for bmi and skin.

We would not expect (?) that the effect of npreg on the logit of the probability of diabetes to be linear, so instead of regarding npreg as a numerical variable it might be a possible solution to group npreg into a categorical variable with a few levels (there are now 16 levels of npreg in the training set). This can be done in many ways, and we have somewhat arbitrary, here set npred.trunc to have npreg.trunc equal to 5 for all npreg$\ge 5$. For the methods that in general nonlinear, we might scale the original npreg.

We therefore center and scale all covariates, and then add a new categorical npreg.trunc to be used with methods that give linear class-boundaries in the covariates space. (This did not turn out to be very interesting.)

cor(ctrain$npreg,ctrain$age,method ="spearman") # just to check 
mean <- apply(ctrain[,-1], 2, mean) #not diabetes, in column 1                                 
std <- apply(ctrain[,-1], 2, sd)
train.x <- data.frame(scale(ctrain[,-1], center = mean, scale = std))
test.x <- data.frame(scale(ctest[,-1], center = mean, scale = std))
train.x$npreg.trunc <- ctrain$npreg
train.x$npreg.trunc[ctrain$npreg>5]<- 5
test.x$npreg.trunc <- ctest$npreg
test.x$npreg.trunc[ctest$npreg>5]<- 5
train.y<- ctrain[,1] # remember that these are normalized (not the npreg.trunc) based on the whole training set
test.y <- ctest[,1]
summary(train.x)

## [1] 0.6383696
##      npreg              glu                bp               skin         
##  Min.   :-1.0667   Min.   :-2.1986   Min.   :-3.8402   Min.   :-2.25049  
##  1st Qu.:-0.7563   1st Qu.:-0.7606   1st Qu.:-0.5881   1st Qu.:-0.83233  
##  Median :-0.4459   Median :-0.1921   Median :-0.1003   Median :-0.02195  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.4852   3rd Qu.: 0.6774   3rd Qu.: 0.5501   3rd Qu.: 0.78843  
##  Max.   : 4.2097   Max.   : 2.5502   Max.   : 3.1518   Max.   : 3.11828  
##       bmi                ped               age           npreg.trunc  
##  Min.   :-2.14934   Min.   :-1.2509   Min.   :-1.0077   Min.   :0.00  
##  1st Qu.:-0.75274   1st Qu.:-0.7167   1st Qu.:-0.8154   1st Qu.:1.00  
##  Median :-0.02738   Median :-0.2268   Median :-0.3348   Median :2.00  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   :2.64  
##  3rd Qu.: 0.58251   3rd Qu.: 0.5511   3rd Qu.: 0.5543   3rd Qu.:5.00  
##  Max.   : 4.90944   Max.   : 5.2651   Max.   : 3.7023   Max.   :5.00

Q29: Use different methods to analyse the data. In particular use one method from each of Modules 4, 8, 9 and 11. For each method you

clearly write out the model and model assumptions for the method
explain how any tuning parameters are chosen or model selection is performed
report (any) insight into the interpretation of the fitted model
evaluate the model using the test data, and report misclassifiation rate (cut-off 0.5 on probability) and plot ROC-curves and give the AUC (for method where class probabilities are given).

resmat=matrix(ncol=3,nrow=8)
# accuracy train, accuracy test, AUC
# for all the methods considered here
colnames(resmat)=c("Accuracy train","Accuracy test","AUC")
rownames(resmat)=c("LDA","logistic regression", "pruned tree","bagged tree","random forest","SVC","SVM","NN h0")

Linear discriminant analysis

Linear class-boundaries
Will outperform logistic regression if it is true that each class can be approximated with a multivariate normal distribution with equal covariance matrices for the two classes.
Priors need to be set, or the class proportions in the data set is used. Means a lot if we use cut-off 0.5 on posterior probability, but does not matter if we calculate ROC.

For more details on LDA see Module page 4.

i=1
library(caret)
library(pROC)
library(MASS)
train.xy=data.frame(train.x,diabetes=train.y)
fit=lda(diabetes~npreg+glu+bp+skin+bmi+ped+age, data=train.xy,prior=c(0.5,0.5))
train.res=predict(fit)
test.res = predict(fit, newdata = test.x)
resmat[i,1]=confusionMatrix(train.res$class,factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(test.res$class,factor(test.y))$overall[1]
library(pROC)
roc.lda = roc(test.y,test.res$posterior[,2],legacy.axes=TRUE,auc=TRUE)
resmat[i,3]=auc(roc.lda)
ggroc(roc.lda)+ggtitle("ROC curve")

Logistic regression

Linear class boundaries in the space of the covariates.
Models the posterior probability for diabetes directly.
Effects of a covariate easily interpreted using the odds.
The full model can be used, or variable selection can be performed, for example with stepAIC.

For more details on LDA see Module page 4.

i=2
fit=glm(diabetes~npreg+glu+bp+skin+bmi+ped+age, data=train.xy,family="binomial")
summary(fit)
fit.logist2=glm(diabetes~factor(npreg.trunc)+glu+bp+skin+bmi+ped+age, data=train.xy)
summary(fit.logist2)
train.res=predict(fit,type="response")
test.res = predict(fit, newdata = test.x,type="response")
train.class=ifelse(train.res>=0.5,1,0)
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.logist = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]=auc(roc.logist)
ggroc(roc.logist)+ggtitle("ROC curve")

## 
## Call:
## glm(formula = diabetes ~ npreg + glu + bp + skin + bmi + ped + 
##     age, family = "binomial", data = train.xy)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8155  -0.6367  -0.3211   0.6147   2.2408  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.0391     0.1697  -6.123 9.16e-10 ***
## npreg         0.3387     0.2021   1.676 0.093775 .  
## glu           1.0641     0.1762   6.039 1.55e-09 ***
## bp           -0.1802     0.1720  -1.048 0.294615    
## skin          0.2012     0.2031   0.990 0.321962    
## bmi           0.6559     0.2166   3.028 0.002458 ** 
## ped           0.6083     0.1668   3.648 0.000265 ***
## age           0.3984     0.2106   1.891 0.058594 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 381.91  on 299  degrees of freedom
## Residual deviance: 253.84  on 292  degrees of freedom
## AIC: 269.84
## 
## Number of Fisher Scoring iterations: 5
## 
## 
## Call:
## glm(formula = diabetes ~ factor(npreg.trunc) + glu + bp + skin + 
##     bmi + ped + age, data = train.xy)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.06771  -0.25678  -0.04716   0.28041   0.86103  
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.2638034  0.0670003   3.937 0.000103 ***
## factor(npreg.trunc)1  0.0139019  0.0786660   0.177 0.859852    
## factor(npreg.trunc)2  0.0867121  0.0881124   0.984 0.325889    
## factor(npreg.trunc)3  0.1107164  0.0936717   1.182 0.238196    
## factor(npreg.trunc)4 -0.0003178  0.1062611  -0.003 0.997616    
## factor(npreg.trunc)5  0.1399598  0.0899835   1.555 0.120950    
## glu                   0.1744862  0.0238551   7.314 2.58e-12 ***
## bp                   -0.0164161  0.0250178  -0.656 0.512234    
## skin                  0.0339951  0.0296635   1.146 0.252736    
## bmi                   0.0872325  0.0302788   2.881 0.004262 ** 
## ped                   0.0818320  0.0227651   3.595 0.000382 ***
## age                   0.0645450  0.0310998   2.075 0.038836 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1453613)
## 
##     Null deviance: 66.667  on 299  degrees of freedom
## Residual deviance: 41.864  on 288  degrees of freedom
## AIC: 286.56
## 
## Number of Fisher Scoring iterations: 2

Classification tree

More flexible than linear methods.
Divides the covariate space into disjoint regions, and use majority vote in each region (based on training samples), and probability can also be given.
Default is deviance as loss when contructing the tree.
To improve performance the tree is pruned using 10-fold misclassificaiton.
Default for prune.tree is to use deviance.

Since we use cross-validation for the pruning, this is not strictly correct. The scaling should have been done as part of the cv.

In the result below the pruned tree was minimal - only one split, and gave a random guessing ROC-curve! So, it would be better to use the full tree, which gave a better ROC (not included here).

For more details on classification trees see Module page 8.

i=3
library(tree)
fit=tree(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy)
summary(fit)
plot(fit)
text(fit,pretty=0)

cv.fit=cv.tree(fit) 
plot(cv.fit$size,cv.fit$dev,type='b')

best=cv.fit$size[which.min(cv.fit$dev)]
prune.fit=prune.tree(fit,best=best)
plot(prune.fit)
text(prune.fit,pretty=0)

train.res=predict(prune.fit)[,2]
test.res=predict(prune.fit,newdata=ctest)[,2]
train.class=ifelse(train.res>=0.5,1,0)
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.tree = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]=auc(roc.tree)
ggroc(roc.tree)+ggtitle("ROC curve")

## 
## Classification tree:
## tree(formula = factor(diabetes) ~ npreg + glu + bp + skin + bmi + 
##     ped + age, data = train.xy)
## Variables actually used in tree construction:
## [1] "glu"  "age"  "bmi"  "ped"  "bp"   "skin"
## Number of terminal nodes:  20 
## Residual mean deviance:  0.4805 = 134.5 / 280 
## Misclassification error rate: 0.12 = 36 / 300

Bagged trees

Using boostrapping to construct many trees, and then base the prediction on the average of the preditions from the trees (or the majority vote).
The number of trees is not a tuning parameter, but needs to be chosen large enough.

For more details on bagged trees see Module page 8.

i=4
library(randomForest)
set.seed(1)
bag=randomForest(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,mtry=7,importance=TRUE)
bag
train.res=predict(bag,type="prob")[,2]
test.res=predict(bag,newdata=test.x,type="prob")[,2]
train.class=ifelse(train.res>=0.5,1,0)
#train.class2=predict(bag,type="response") #same as train.class
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.bag = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]=auc(roc.bag)
ggroc(roc.bag)+ggtitle("ROC curve")

varImpPlot(bag,pch=20)

## 
## Call:
##  randomForest(formula = factor(diabetes) ~ npreg + glu + bp +      skin + bmi + ped + age, data = train.xy, mtry = 7, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 23.33%
## Confusion matrix:
##     0  1 class.error
## 0 167 33       0.165
## 1  37 63       0.370

Random forest

Random forest is performed very similar to bagging, but at each split in the tree not all covariates can be chosen.
Rule of thumb $\sqrt{p}$, and we have $p=7$ so 2 or 3 for us.
The number of trees is also here not a tuning parameter.

For more details on random forest see Module page 8.

i=5
library(randomForest)
set.seed(1)
rf=randomForest(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,mtry=3,importance=TRUE) #default is 500 trees
rf
train.res=predict(rf,type="prob")[,2]
test.res=predict(rf,newdata=test.x,type="prob")[,2]
train.class=ifelse(train.res>=0.5,1,0)
#train.class2=predict(rf,type="response") #same as train.class
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.rf = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]=auc(roc.rf)
ggroc(roc.rf)+ggtitle("ROC curve")

varImpPlot(rf,pch=20)

## 
## Call:
##  randomForest(formula = factor(diabetes) ~ npreg + glu + bp +      skin + bmi + ped + age, data = train.xy, mtry = 3, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 22.67%
## Confusion matrix:
##     0  1 class.error
## 0 172 28        0.14
## 1  40 60        0.40

We have only covered boosting for regression, not for classification, so therefore boosting is not added here.

Support vector classifier and machine

MMC requires a linear separable problem, we have seen that we don’t have that, so MMC is not considered.
SVC gives linear class boundaries, but might outperform logistic regression when the classes are rather separated. (Logistic regression with ridge penalty not so different from SVC.)
SVM uses different kernal, here radial basis function kernel
Two tuning parameters, cost and gamma (for the radial basis function kernel), found with 10-fold CV over a grid in tune

As for the pruning of the tree, running tune below on scaled data is not strictly correct. The scaling should have been part of the cross-validation.

For more details on SVC and SVM see Module page 9.

i=6
library(e1071)
svcfit=svm(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="linear",cost=1)
summary(svcfit)
# support vector classifier can tune cost function
svcs=tune(svm,factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="linear",ranges=list(cost=c(1e-5,1e-4, 1e-3, 1e-2,1e-1,1,5,10)))
summary(svcs)#$ best is cost 0.01
svcfit=svm(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="linear",cost=0.01)
train.class=predict(svcfit)
test.class=predict(svcfit,newdata=test.x)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]

i=7
# support vector machine with radial kernel
set.seed(4268)
svms=tune(svm,factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="linear",ranges=list(cost=c(1e-3, 1e-2,1e-1,1,5,10),gamma=c(0.01,1,5,10)))
summary(svms)
svms # cost=1 and gamma=0.01
svmfit=svm(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="radial",cost=1,gamma=0.01)
train.class=predict(svmfit)
test.class=predict(svmfit,newdata=test.x)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]

## 
## Call:
## svm(formula = factor(diabetes) ~ npreg + glu + bp + skin + bmi + 
##     ped + age, data = train.xy, kernel = "linear", cost = 1, 
##     scale = FALSE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  linear 
##        cost:  1 
##       gamma:  0.1428571 
## 
## Number of Support Vectors:  144
## 
##  ( 71 73 )
## 
## 
## Number of Classes:  2 
## 
## Levels: 
##  0 1
## 
## 
## 
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##  0.01
## 
## - best performance: 0.2166667 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1 1e-05 0.3333333 0.07200823
## 2 1e-04 0.3333333 0.07200823
## 3 1e-03 0.3333333 0.07200823
## 4 1e-02 0.2166667 0.04513355
## 5 1e-01 0.2300000 0.05762801
## 6 1e+00 0.2300000 0.04830459
## 7 5e+00 0.2300000 0.04830459
## 8 1e+01 0.2300000 0.04830459
## 
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1  0.01
## 
## - best performance: 0.2033333 
## 
## - Detailed performance results:
##     cost gamma     error dispersion
## 1  1e-03  0.01 0.3333333 0.05211573
## 2  1e-02  0.01 0.2066667 0.05621827
## 3  1e-01  0.01 0.2133333 0.05018484
## 4  1e+00  0.01 0.2033333 0.05762801
## 5  5e+00  0.01 0.2033333 0.06929985
## 6  1e+01  0.01 0.2033333 0.06929985
## 7  1e-03  1.00 0.3333333 0.05211573
## 8  1e-02  1.00 0.2066667 0.05621827
## 9  1e-01  1.00 0.2133333 0.05018484
## 10 1e+00  1.00 0.2033333 0.05762801
## 11 5e+00  1.00 0.2033333 0.06929985
## 12 1e+01  1.00 0.2033333 0.06929985
## 13 1e-03  5.00 0.3333333 0.05211573
## 14 1e-02  5.00 0.2066667 0.05621827
## 15 1e-01  5.00 0.2133333 0.05018484
## 16 1e+00  5.00 0.2033333 0.05762801
## 17 5e+00  5.00 0.2033333 0.06929985
## 18 1e+01  5.00 0.2033333 0.06929985
## 19 1e-03 10.00 0.3333333 0.05211573
## 20 1e-02 10.00 0.2066667 0.05621827
## 21 1e-01 10.00 0.2133333 0.05018484
## 22 1e+00 10.00 0.2033333 0.05762801
## 23 5e+00 10.00 0.2033333 0.06929985
## 24 1e+01 10.00 0.2033333 0.06929985
## 
## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost gamma
##     1  0.01
## 
## - best performance: 0.2033333

Feedforward neural network

Only consider one hidden layer, and only tuning parameter is the number of hidden nodes.
nnet only allows sigmoid activiation of hidden layer, and only one hidden layer.
Make sure that you don’t have a linear output layer, and that not quadratic loss is minimized (entropy=TRUE).
If you don’t want a hidden layer then size=0 and then skip=TRUE will give weights from input to output layer. Then our setup will be the same as logistic regression.
Make sure that you find a nice optimum, many local optima exists for this problem.
In library e1071 the tune function can be used for cross-validation also for nnet.

For more details on neural nets see Module page 11.

library(nnet)

# first just check out calls
fit<- nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,size=5,maxit=5000,linout=FALSE,entropy=TRUE,reltol=1e-14,abstol=1e-14,trace=FALSE)
summary(fit)
fit2<- nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,size=5,maxit=5000,linout=FALSE,entropy=TRUE,reltol=1e-14,abstol=1e-14,trace=FALSE)
summary(fit2)
# seems to be many possible local minima with 46 weights and 300 observations
train.res=predict(fit)
train.class=ifelse(train.res>=0.5,1,0)
test.res=predict(fit,newdata=test.x,type="raw")
test.class=ifelse(test.res>=0.5,1,0)
confusionMatrix(factor(train.class),factor(train.y))$overall[1]
confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.nnet = roc(test.y,test.res,legacy.axes=TRUE)
auc(roc.nnet)
ggroc(roc.nnet)+ggtitle("ROC curve")

## a 7-5-1 network with 46 weights
## options were - entropy fitting 
##    b->h1   i1->h1   i2->h1   i3->h1   i4->h1   i5->h1   i6->h1   i7->h1 
##  -957.49   -24.44  1645.56  1320.70 -2996.66  1448.31   934.96  2750.45 
##    b->h2   i1->h2   i2->h2   i3->h2   i4->h2   i5->h2   i6->h2   i7->h2 
##   851.81 -1687.26 -3472.30 -2940.60 -3601.44   478.51 -2205.83 -1345.00 
##    b->h3   i1->h3   i2->h3   i3->h3   i4->h3   i5->h3   i6->h3   i7->h3 
## -3880.13 -1624.10 -2239.02   790.07  -320.71  -389.17 -1792.22 -7814.79 
##    b->h4   i1->h4   i2->h4   i3->h4   i4->h4   i5->h4   i6->h4   i7->h4 
##  3233.78  -746.15  4029.66  3026.63  5188.73  2458.61  -638.22  -214.68 
##    b->h5   i1->h5   i2->h5   i3->h5   i4->h5   i5->h5   i6->h5   i7->h5 
##   857.68  1051.00 -2063.77  1027.13 -1736.31 -1302.33 -6291.65 -2091.15 
##     b->o    h1->o    h2->o    h3->o    h4->o    h5->o 
##    -6.06     2.21     4.44    -4.28     6.33    -3.10 
## a 7-5-1 network with 46 weights
## options were - entropy fitting 
##   b->h1  i1->h1  i2->h1  i3->h1  i4->h1  i5->h1  i6->h1  i7->h1 
##  -99.34   16.13 -215.79  109.26  -57.03  179.53 -189.83 -345.37 
##   b->h2  i1->h2  i2->h2  i3->h2  i4->h2  i5->h2  i6->h2  i7->h2 
##   12.77  224.54  -16.32   69.92 -109.88 -314.51  -86.23 -281.65 
##   b->h3  i1->h3  i2->h3  i3->h3  i4->h3  i5->h3  i6->h3  i7->h3 
## -172.74   52.00   -0.63   -2.10 -127.68  250.39  119.53   16.46 
##   b->h4  i1->h4  i2->h4  i3->h4  i4->h4  i5->h4  i6->h4  i7->h4 
##  -37.60  -75.04 -260.51 -140.91    8.30  594.31 -123.51  247.80 
##   b->h5  i1->h5  i2->h5  i3->h5  i4->h5  i5->h5  i6->h5  i7->h5 
## -594.51 -377.70 -164.25  -70.91 -544.19   46.73  357.75  -76.61 
##    b->o   h1->o   h2->o   h3->o   h4->o   h5->o 
##    5.38   -3.39   -5.59    3.95   -5.21   -6.17 
## Accuracy 
##      0.9 
##  Accuracy 
## 0.7241379 
## Area under the curve: 0.774

# ready to choose a good number of hidden nodes with 5-fold CV
grid=c(1:10,15,20,25,30,50)
k <- 5
set.seed(123)
indices <- sample(1:nrow(train.x))
folds <- cut(indices, breaks = k, labels = FALSE)
table(folds) # 60 in each fold

res=matrix(NA,ncol=k,nrow=length(grid)+1) #adding 0 also
for (j in 1:k)
{
  thistrain=(1:dim(train.x)[1])[folds!=j]
  thisvalid=(1:dim(train.x)[1])[folds==j]
  mean <- apply(ctrain[thistrain,], 2, mean)                                
  std <- apply(ctrain[thistrain,], 2, sd)
  new <- scale(ctrain, center = mean, scale = std)       

  for (i in 1:length(grid))
  {
    thissize=grid[i]
    fit=nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age, data=new, size=thissize,subset=thistrain,linout=FALSE,maxit=5000,entropy=TRUE,
             reltol=1e-14,abstol=1e-14)
    pred=predict(fit,newdata=new[thisvalid,],type="raw")
    valid.class=ifelse(pred>=0.5,1,0)
    res[i,j]=sum(abs(train.xy[thisvalid,]$diabetes-valid.class))
  }
    fit=nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age, data=new, size=0,skip=TRUE,subset=thistrain,linout=FALSE,maxit=10000,entropy=TRUE,
             reltol=1e-14,abstol=1e-14)
    pred=predict(fit,newdata=new[thisvalid,],type="raw")
    valid.class=ifelse(pred>=0.5,1,0)
    res[i+1,j]=sum(abs(train.xy[thisvalid,]$diabetes-valid.class))
  }
misclass=apply(res,1,sum)/nrow(train.x)

misclass
# 0 hidden nodes won! But be aware that there are multiple minima.

##  [1] 0.3333333 0.3333333 0.3333333 0.3033333 0.2900000 0.2666667 0.2733333
##  [8] 0.2600000 0.2166667 0.3233333 0.2833333 0.2433333 0.2333333 0.2500000
## [15] 0.2300000 0.2133333

i=8
fit<- nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,size=0,skip=TRUE,linout=FALSE,entropy=TRUE,maxit=5000,reltol=1e-14,abstol=1e-14)
summary(fit)
train.res=predict(fit)
train.class=ifelse(train.res>=0.5,1,0)
test.res=predict(fit,newdata=test.x)
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]<-confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]<-confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.nnet = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]<-auc(roc.nnet)

## # weights:  8
## initial  value 229.710090 
## iter  10 value 127.918371
## final  value 126.919747 
## converged
## a 7-0-1 network with 8 weights
## options were - skip-layer connections  entropy fitting 
##  b->o i1->o i2->o i3->o i4->o i5->o i6->o i7->o 
## -1.04  0.34  1.06 -0.18  0.20  0.66  0.61  0.40

Q30: Conclude with choosing a winning method, and explain why you mean that this is the winning method.

print(resmat)

plot(roc.lda,col=1) #solid
plot(roc.logist,add=TRUE,lty="dashed",col=1) 
plot(roc.tree,add=TRUE,lty="solid",col=2) 
plot(roc.bag,add=TRUE,lty="dashed",col=2) 
plot(roc.rf,add=TRUE,lty="dotted",col=2) 
plot(roc.nnet,add=TRUE,lty="solid",col=3)

##                     Accuracy train Accuracy test       AUC
## LDA                      0.7733333     0.7887931 0.8490155
## logistic regression      0.7966667     0.7974138 0.8450775
## pruned tree              0.8266667     0.3318966 0.5000000
## bagged tree              0.7666667     0.7629310 0.8277755
## random forest            0.7766667     0.7672414 0.8346460
## SVC                      0.8100000     0.7758621        NA
## SVM                      0.7966667     0.7758621        NA
## NN h0                    0.7966667     0.7974138 0.8450775

The logistic regression might be regarded as the winning method based both on interpretability, misclassification rate and AUC.

References

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics New York.

Wickham, Hadley. 2016. Ggplot2 Elegant Graphics for Data Analysis. Springer.

Compulsory exercise 2

TMA4268 Statistical Learning V2019

Michail Spitieris, Andreas Strand and Mette Langaas

Deadline: April 12 at 16 on Bb

R packages

Problem 1: Regression [6 points]

a) Know your data!

b) Modelling `logprice` and `carat`

c) Best subset selection

d) Variable selection with the lasso

e) Regression tree

f) Random forest

g) Comparing results

Problem 2: Unsupervised learning [3 points]

a) Understanding PCA

b) Hierarchical clustering

Problem 3: Flying solo with diabetes data [6 points]

Linear discriminant analysis

Logistic regression

Classification tree

Bagged trees

Random forest

Support vector classifier and machine

Feedforward neural network

References

Compulsory exercise 2

TMA4268 Statistical Learning V2019

Michail Spitieris, Andreas Strand and Mette Langaas

Deadline: April 12 at 16 on Bb

R packages

Problem 1: Regression [6 points]

a) Know your data!

b) Modelling logprice and carat

c) Best subset selection

d) Variable selection with the lasso

e) Regression tree

f) Random forest

g) Comparing results

Problem 2: Unsupervised learning [3 points]

a) Understanding PCA

b) Hierarchical clustering

Problem 3: Flying solo with diabetes data [6 points]

Linear discriminant analysis

Logistic regression

Classification tree

Bagged trees

Random forest

Support vector classifier and machine

Feedforward neural network

References

b) Modelling `logprice` and `carat`