(Last changes: 07.05)
You need to install the following packages in R to run the code in this file.
install.packages("knitr") #probably already installed
install.packages("rmarkdown") #probably already installed
install.packages("bestglm")# for subset selection with categorical variables
install.packages("glmnet")# for lasso
install.packages("tree") #tree
install.packages("randomForest") #for random forest
install.packages("ElemStatLearn") #dataset in Problem 2
BiocManager::install(c("pheatmap")) #heatmap in Problem 2
The following are packages that might be useful.
install.packages("ggplot2")
install.packages("GGally") # for ggpairs
install.packages("caret") #for confusion matrices
install.packages("pROC") #for ROC curves
install.packages("e1071") # for support vector machines
install.packages("nnet") # for feed forward neural networks
In this problem we will work with understanding and applying different regression methods, and will use the diamonds
data set from the ggplot2
package. The diamonds
data set consists of
price
(in US dollars) - which will be our response,and quality information (9 covariates) for around 54000 diamonds. According to Wickham (2016) (Section 3.10) there are four C’s of diamond quality:
carat
(weight),cut
(quality of the cut: Fair/Good/VeryGood/Premium/Ideal),colour
(from worst J to best D) andclarity
(from worst to best: I1, SI1, SI2, VS1, VVS1, VVS2, IF).In addition there are five physical measurements:
depth
(total depth percentage, calculated from x, y and z),table
(width of top of diamond relative to widest point),xx
(length in mm),yy
(width in mm) andzz
(depth in mm)explained with a drawing in Figure 3.1 of the Wickham (2016) book (ebook available for free for NTNU students). There are no missing data. In addition we have made two additional transformed variables:
logprice
: logarithm (base 10) of the price, andlogcarat
: logarithm (base 10) of the carat.To be able to do the analyses without too much hassle (running time for analyses and plotting) we constructed a smaller version of the data set, and divided it into a training set dtrain
and a test set dtest
, to be loaded from the course web-page as follows.
all=dget("https://www.math.ntnu.no/emner/TMA4268/2019v/data/diamond.dd")
dtrain=all$dtrain
dtest=all$dtest
Our aim will be to predict the price of the diamond, based on some or all of the covariates. We would also like to understand which of the covariates are important in predicting the price and how the covariates are related to the price.
Q1: Would you choose price
or logprice
as response variable? Justify your choice. Next, plot your choice of response pairwise with carat
, logcarat
, color
, clarity
and cut
. Comment.
Answer
par(mfrow=c(1,2))
hist(dtrain$price)
hist(dtrain$logprice)
plot(dtrain$carat,dtrain$logprice)
plot(dtrain$logcarat,dtrain$logprice)
par(mfrow=c(3,1), mar = c(4,3,0,0), mgp=c(2, 1, 0))
plot(logprice~color,data=dtrain)
plot(logprice~clarity,data=dtrain)
plot(logprice~cut,data=dtrain)
par(mfrow=c(1,1))
logprice
as response. The variable price
is positive and right-skewed. The log-transform is not skewed and is likely to normalize the variance across observations. It is also convenient that predictions for price always are positive when using the log-transform.logprice
means that we have a multiplicative model in the covariates.logprice
and logcarat
is approximately linear.logprice
is increasing with the letter of color
while we would expect the opposite.SI2
, VVS1
and IF
of clarity
has many outliers of logprice
which make interpretation hard. Still, clarity
seems like an important variable to include because of large differences in logprice
.cut
might be less important and it is surprising to see the smallest average for Ideal
.logprice
and carat
Use the local regression model \(\texttt{logprice} = \beta_0 + \beta_1 \texttt{carat}+ \beta_2 \texttt{carat}^2\) weighted by the tricube kernel \(K_{i0}\).
Q2: What is the predicted price of a diamond weighting 1 carat. Use the closest 20% of the observations.
Q3: What choice of \(\beta_1\), \(\beta_2\) and \(K_{i0}\) would result in KNN-regression?
Answer
plot(dtrain$carat,dtrain$logprice)
fit=loess(dtrain$logprice~dtrain$carat,span=0.2,degree=2)
pred=predict(fit)
lines(dtrain$carat[order(dtrain$carat)],pred[order(dtrain$carat)],col=2,lwd=2)
10^predict(loess(logprice~carat, span=.2, data=dtrain), 1)
## [1] 5100.674
caret
closest to 1. The \(\beta_0\) is then estimated by minimizing \(\sum_{i=1}^n K_{i0}(y_i-\beta_0)^2\), which will give \(\hat{\beta}_0\) to be the mean of the response of 20% closes observations.Q4: Describe how you can perform model selection in regression with AIC as criterion.
Q5: What are the main differences between using AIC for model selection and using cross-validation (with mean squared test error MSE)?
Q6: See the code below for performing model selection with bestglm()
based on AIC. What kind of contrast is used to represent cut
, color
and clarity
? Write down the final best model and explain what you can interpret from the model.
Q7: Calculate and report the MSE of the test set using the best model (on the scale of logprice
).
library(bestglm)
ds=as.data.frame(within(dtrain,{
y=logprice # setting reponse
logprice=NULL # not include as covariate
price=NULL # not include as covariate
carat=NULL # not include as covariate
}))
fit=bestglm(Xy=ds, IC="AIC")$BestModel
summary(fit)
##
## Call:
## lm(formula = y ~ ., data = data.frame(Xy[, c(bestset[-1], FALSE),
## drop = FALSE], y = y))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31283 -0.03717 0.00033 0.03564 0.58991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.985969 0.042993 69.452 < 2e-16 ***
## logcarat 1.580675 0.029042 54.428 < 2e-16 ***
## cutGood 0.028558 0.005747 4.970 6.93e-07 ***
## cutVery Good 0.040480 0.005285 7.660 2.22e-14 ***
## cutPremium 0.042984 0.005295 8.118 5.90e-16 ***
## cutIdeal 0.054829 0.005223 10.498 < 2e-16 ***
## colorE -0.021787 0.003031 -7.189 7.51e-13 ***
## colorF -0.038892 0.003107 -12.519 < 2e-16 ***
## colorG -0.067999 0.003024 -22.488 < 2e-16 ***
## colorH -0.112815 0.003235 -34.877 < 2e-16 ***
## colorI -0.165072 0.003686 -44.783 < 2e-16 ***
## colorJ -0.223651 0.004543 -49.233 < 2e-16 ***
## claritySI2 0.174988 0.007933 22.060 < 2e-16 ***
## claritySI1 0.251631 0.007876 31.950 < 2e-16 ***
## clarityVS2 0.315556 0.007924 39.824 < 2e-16 ***
## clarityVS1 0.346409 0.008051 43.027 < 2e-16 ***
## clarityVVS2 0.402528 0.008233 48.894 < 2e-16 ***
## clarityVVS1 0.433326 0.008549 50.684 < 2e-16 ***
## clarityIF 0.473681 0.009154 51.745 < 2e-16 ***
## xx 0.069331 0.006610 10.489 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05978 on 4980 degrees of freedom
## Multiple R-squared: 0.9815, Adjusted R-squared: 0.9814
## F-statistic: 1.39e+04 on 19 and 4980 DF, p-value: < 2.2e-16
Answer
attr(model.matrix(fit),"contrasts")
.attr(model.matrix(fit),"contrasts")
pred=predict(fit,newdata=dtest)
testerror=mean((pred-dtest$logprice)^2)
testerror
## $cut
## [1] "contr.treatment"
##
## $color
## [1] "contr.treatment"
##
## $clarity
## [1] "contr.treatment"
##
## [1] 0.003600305
Q8: Build a model matrix for the covariates ~logcarat+cut+clarity+color+depth+table+xx+yy+zz-1
. What is the dimension of this matrix?
Q9: Fit a lasso regression to the diamond data with logprice
as the response and the model matrix given in Q8. How did you find the value to be used for the regularization parameter?
Q10: Calculate and report the MSE of the test set (on the scale of logprice
).
Answer
library(glmnet)
model = formula(~logcarat+cut+clarity+color+depth+table+xx+yy+zz-1)
mm = model.matrix(model,data=dtrain)
# alternatively model.matrix(~logcarat+cut+clarity+color+depth+table+xx+yy+zz-1,data=dtrain)
set.seed(4268)
fit = cv.glmnet(mm,dtrain$logprice)
plot(fit)
pred = predict(fit, newx=model.matrix(model,data=dtest))
testerror=mean((pred-dtest$logprice)^2)
coef(fit)
testerror
fit$lambda.1se
## 25 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 3.100529e+00
## logcarat 1.560376e+00
## cutFair -5.049215e-02
## cutGood -1.609956e-02
## cutVery Good -4.339380e-03
## cutPremium .
## cutIdeal 1.016087e-02
## claritySI2 7.482682e-02
## claritySI1 1.504417e-01
## clarityVS2 2.134576e-01
## clarityVS1 2.429941e-01
## clarityVVS2 2.980627e-01
## clarityVVS1 3.273970e-01
## clarityIF 3.664982e-01
## colorE -1.409037e-02
## colorF -3.039127e-02
## colorG -5.865112e-02
## colorH -1.038048e-01
## colorI -1.559022e-01
## colorJ -2.140430e-01
## depth .
## table 8.269798e-05
## xx 1.396869e-02
## yy 5.669450e-02
## zz 2.085783e-03
## [1] 0.003839289
## [1] 0.0004755529
depth
, table
, yy
and zz
. The lasso drops depth
, table
, zz
, 2 predictors for cut
, 3 predictors for clarity
and 3 predictors for color
.Q11: A regression tree to model is built using a greedy approach. What does that mean? Explain the strategy used for constructing a regression tree.
Q12: Is a regression tree a suitable method to handle both numerical and categorical covariates? Elaborate.
Q13: Fit a (full) regression tree to the diamond data with logprice
as the response (and the same covariates as for c and d), and plot the result. Comment briefly on you findings.
Q14: Calculate and report the MSE of the test set (on the scale of logprice
).
Answer
library(tree)
library(RColorBrewer)
fit = tree(logprice~logcarat+cut+color+clarity+depth+table+xx+yy+zz,data=dtrain)
pred = predict(fit, newdata=dtest)
testerror = mean((pred-dtest$logprice)^2)
plot(fit, type="uniform")
text(fit, pretty=1)
par(mar = c(5.1, 4.1, 1.1, 8.1), xpd = TRUE, pty = "s")
y = cut(dtrain$logprice, breaks = 7)
col.y = brewer.pal(9, "BuGn")
palette(col.y[3:9])
plot(xx ~ yy, col = y, data = dtrain, pch = 20)
partition.tree(fit, add = TRUE)
legend("topright", inset = c(-0.4, 0.2), title = "Logprice", legend = levels(y),
col = col.y[3:9], pch = 20, box.col = NA)
palette("default")
yy
and one for xx
. This model is much simpler than the linear models discussed before, but also less accurate.Q15: Explain the motivation behind bagging, and how bagging differs from random forest? What is the role of bootstrapping?
Q16: What are the parameter(s) to be set in random forest, and what are the rules for setting these?
Q17: Boosting is a popular method. What is the main difference between random forest and boosting?
Q18: Fit a random forest to the diamond data with logprice
as the response (and the same covariates as before). Comment on your choice of parameter (as decribed in Q16).
Q19: Make a variable importance plot and comment on the plot. Calculate and report the MSE of the test set (on the scale of logprice
).
Answer
library(randomForest)
set.seed(4268)
rf=randomForest(logprice~logcarat+cut+color+clarity+depth+table+xx+yy+zz,
data=dtrain,
mtry=3,ntree=1000,importance=TRUE)
yhat.rf=predict(rf,newdata=dtest)
testerror=mean((yhat.rf-dtest$logprice)^2)
varImpPlot(rf,pch=20)
clarity
and color
are important for predictive power. From the right plot we observe that variance in the response is well explained by yy
, logcarat
, xx
and zz
. The testerror is 0.0027789.Q20: Finally, compare the results from c (subset selection), d (lasso), e (tree) and f (random forest): Which method has given the best performance on the test set and which method has given you the best insight into the relationship between the price and the covariates?
Answer
We will use a data set called nci
from the Friedman, Hastie, and Tibshirani (2001) book. The data set consists of measurement on the activity of \(p=6830\) genes for \(n=64\) samples from cancer tumors. Our aim here is to use unsupervised methods to study the relationship among the cancer samples. The data is on a transformed scale, where a negative number means that the gene has a lower regulation than some reference sample and a positive number gives a higer regularization. The following names are given for the cancer tumors (R print-out). It is known that K562 cells are leukemia cells and MCF7 are breast cancer cells. Observe the one sample labelled as UNKNOWN
. More information: http://genome-www.stanford.edu/nci60/
Let \({\bf X}\) be a \(n\times p\) matrix of these observations, and \({\bf Z}\) a standardized version, where column (gene) means are 0 and column standard deviations are 1. Further, let \({\bf R}\) be the \(p \times p\) covariance matrix of the gene-standardized observations, which can be estimated as
\[ {\hat {\bf R}}=\frac{1}{n-1}\sum_{i=1}^n ({\bf Z}_i-\bar{\bf Z})({\bf Z}_i-\bar{\bf Z})^T\] where \(\bar{\bf Z}=\frac{1}{n}\sum_{i=1}^n {\bf Z}_i\).
Studying the relationship between the different tumor samples is hard due the the high dimension of the gene data, and we know that the gene expressions of different genes may be correlated. One way to proceed is to calculate principal components, which we find using the eigenvectors of the covariance matrix \({\hat {\bf R}}\).
Q21: What is the definition of a principal component score, and how is the score related to the eigenvectors of the matrix \({\hat {\bf R}}\).
Study the R-code below. The eigenvectors of \({\hat {\bf R}}\) are found in pca$rotation
and the principal component scores in pca$x
.
Q22: Explain what is given in the plot with title “First eigenvector”. Why are there only \(n=64\) eigenvectors and \(n=64\) principal component scores?
Q23: How many principal components are needed to explain 80% of the total variance in \({\bf Z}\)? Why is sum(pca$sdev^2)=p
?
Hint: Total variance is given as the trace of \({\hat {\bf R}}\), the trace of a square matrix is the sum of the diagonal elements.
Q24: Study the PC1 vs PC2 plot, and comment on the groupings observed. What can you say about the placement of the K262
, MCF7
and UNKNOWN
samples? Produce the same plot for two other pairs of PCs and comment on your observations.
library(ElemStatLearn)
X=t(nci) #n times p matrix
table(rownames(X))
ngroups=length(table(rownames(X)))
cols=rainbow(ngroups)
cols[c(4,5,7,8,14)] = "black"
pch.vec = rep(4,14)
pch.vec[c(4,5,7,8,14)] = 15:19
colsvsnames=cbind(cols,sort(unique(rownames(X))))
colsamples=cols[match(rownames(X),colsvsnames[,2])]
pchvsnames=cbind(pch.vec,sort(unique(rownames(X))))
pchsamples=pch.vec[match(rownames(X),pchvsnames[,2])]
Z=scale(X)
pca=prcomp(Z)
plot(pca$x[,1],pca$x[,2],xlab="PC1",ylab="PC2",pch=pchsamples,col=colsamples)
legend("bottomright",legend = colsvsnames[,2],cex=0.55,col=cols,pch = pch.vec)
plot(1:dim(X)[2],pca$rotation[,1],type="l",xlab="genes",ylab="weight",main="First eigenvector")
##
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA
## 7 5 7 1 1 6
## MCF7A-repro MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE
## 1 1 8 9 6 2
## RENAL UNKNOWN
## 9 1
Answers:
dim(pca$rotation)
dim(pca$x)
## [1] 6830 64
## [1] 64 64
Q21: The score \(i\) of the \(m-th\) principal component is the projection of \(i\) data point on the direction of \(m-th\) principal component, and is given by \(w_{im} = \phi_{1m}z_{i1}+\phi_{2m}z_{i2}+\cdots+\phi_{pm}z_{ip},\) where \(\phi_{1m},\phi_{2m},\cdots,\phi_{pm}\) are the elements of the \(m-th\) eigenvector of \({\hat {\bf R}}.\)
The eigenvectors of \({\hat {\bf R}}\) are used as weights on the original observations to produce the principal component score.
Q22: The “First eigenvector” plot shows the contribution of variables (genes) to the first principal component.
There are \(min(n-1,p) = 63\) principal components. In R we get \(64\) due to the numerical method that is used. Check that the variation of \(64-th\) principal component (pca$sdev[64]
) is almost \(0\).
Q23:
var = pca$sdev^2
prop_var = cumsum(var)/sum(var)
prop_var80 = min(which(prop_var>=0.8))
prop_var80
plot(1:length(prop_var), prop_var, type = 'b', xlim = c(0,64),xlab="Principal Component",
ylab="Cumulative Proportion of Variance Explained")
abline(v = prop_var80, lty = 2)
## [1] 32
Each variable (column) in the standardized version of data \({\bf Z}\) has variance \(1\), which means that the total variance contained in data \({\bf Z}\) is \(p \times 1=p.\) The \(64\) PCs capture all the variability contained in the data \({\bf Z}\) which is \(p.\)
Q24: We observe two groups, one consists ofK262A-repro
-K262B-repro
and the other one consists of MCF7A-repro
-MCF7D-repro
.
Hierarchical clustering can be used to group the cancer samples, and then we may see which samples that are most similar to the five samples named K562
, MCF7
and UNKNOWN
.
Q25:: Explain what it means to use Euclidean distance and average linkage for hierarchical clustering.
Q26:: Perform hierarchical clustering with Euclidean distance and average linkage on the scaled gene expression in Z
. Observe where our samples labelled as K562, MCF7 and UNKNOWN
are placed in the dendrogram. Which conclusions can you draw from this?
Q27:: Study the R-code and plot below. Here we have performed hierarchical clustering based on thefirst 64 principal component instead of the gene expression data in Z
. What is the difference between using all the gene expression data and using the first 64 principal components in the clustering? We have plotted the dendrogram together with a heatmap of the data. Explain what is shown in the heatmap. What is given on the horizontal axis, vertical axis, value in the pixel grid?
library(pheatmap)
npcs=64
pheatmap(pca$x[,1:npcs],scale="none",cluster_col=FALSE,cluster_row=TRUE,clustering_distance_rows = "euclidean",clustering_method="average",fontsize_row=5,fontsize_col=5)
Answers:
Q25: The Euclidean distance is a used as a measure of similarity between two observations while the average linkage is a measure of similarity between two clusters. In average linkage we compute all pairwise distances between the observations in cluster A and in cluster B, and we record the average of these dissimilarities.
Q26:
samplenames=rownames(Z)
data.dist=dist(Z)
plot(hclust(data.dist, method="average"), labels=samplenames, main="Average Linkage", xlab="", sub="",ylab="",cex=0.4)
Q27: The clustering result using all available principal components is the same as using the original data. This is because the \(64\) principal components explain all the variability of the original data.
pc.dist = dist(pca$x[,1:npcs])
pc.samplenames = rownames(pca$x)
plot(hclust(pc.dist, method="average"), labels=pc.samplenames, main="Average Linkage", xlab="", sub="",ylab="",cex=0.4)
A learning objective in this course is: The student knows, based on an existing data set, how to choose a suitable statistical model, apply sound statistical methods, and perform the analyses using statistical software. The student knows how to present the results from the statistical analyses, and which conclusions can be drawn from the analyses.
This problem addresses that objective.
We will use the classical data set of diabetes from a population of women of Pima Indian heritage in the US, available in the R MASS
package. The following information is available for each woman:
0
= not present, 1
= presentWe will use a training set (called ctrain
) with 300 observations (200 non-diabetes and 100 diabetes cases) and a test set (called ctest
) with 232 observations (155 non-diabetes and 77 diabetes cases). Our aim is to make a classification rule for diabetes (or not) based on the available data.
flying=dget("https://www.math.ntnu.no/emner/TMA4268/2019v/data/flying.dd")
ctrain=flying$ctrain
ctest=flying$ctest
Q28: Start by getting to know the training data, by producing summaries and plots. Write a few sentences about what you observe and include your top 3 informative plots and/or outputs.
# not to be included
library(ggplot2)
library(GGally)
library(corrplot)
ctrain.tmp=ctrain
ctrain.tmp$diabetes=as.factor(ctrain.tmp$diabetes)
ctrain.tmp$npreg=as.factor(ctrain.tmp$npreg)
ggpairs(ctrain.tmp,cardinality_threshold = 16)
summary(ctrain)
table(ctrain$npreg)
corrplot(cor(ctrain))
## diabetes npreg glu bp
## Min. :0.0000 Min. : 0.000 Min. : 57.0 Min. : 24.00
## 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.:100.0 1st Qu.: 64.00
## Median :0.0000 Median : 2.000 Median :117.0 Median : 70.00
## Mean :0.3333 Mean : 3.437 Mean :122.7 Mean : 71.23
## 3rd Qu.:1.0000 3rd Qu.: 5.000 3rd Qu.:143.0 3rd Qu.: 78.00
## Max. :1.0000 Max. :17.000 Max. :199.0 Max. :110.00
## skin bmi ped age
## Min. : 7.00 Min. :18.20 Min. :0.0850 Min. :21.00
## 1st Qu.:21.00 1st Qu.:27.88 1st Qu.:0.2532 1st Qu.:23.00
## Median :29.00 Median :32.90 Median :0.4075 Median :28.00
## Mean :29.22 Mean :33.09 Mean :0.4789 Mean :31.48
## 3rd Qu.:37.00 3rd Qu.:37.12 3rd Qu.:0.6525 3rd Qu.:37.25
## Max. :60.00 Max. :67.10 Max. :2.1370 Max. :70.00
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17
## 37 76 43 34 22 16 16 17 10 12 6 5 3 1 1 1
All covariates seem to be somewhat related to diabetes, and from the boxplot, glu
seems to be a strong predictor.
The different covariates have very different scales, and at least for SVM and neural networks it is important to normalized the data (center and scale), and that this is done based on summary statistics on the training data. However, remember that when performing cross-validation we need to do this scaling only based on the training part.
The only covariate that is not strictly numerical is npreg
. There is a high correlation between npreg
and age
, of 0.65. Of the other covariates are not very correlated, except for bmi
and skin
.
We would not expect (?) that the effect of npreg
on the logit of the probability of diabetes to be linear, so instead of regarding npreg
as a numerical variable it might be a possible solution to group npreg
into a categorical variable with a few levels (there are now 16 levels of npreg
in the training set). This can be done in many ways, and we have somewhat arbitrary, here set npred.trunc
to have npreg.trunc
equal to 5 for all npreg
\(\ge 5\). For the methods that in general nonlinear, we might scale the original npreg
.
We therefore center and scale all covariates, and then add a new categorical npreg.trunc
to be used with methods that give linear class-boundaries in the covariates space. (This did not turn out to be very interesting.)
cor(ctrain$npreg,ctrain$age,method ="spearman") # just to check
mean <- apply(ctrain[,-1], 2, mean) #not diabetes, in column 1
std <- apply(ctrain[,-1], 2, sd)
train.x <- data.frame(scale(ctrain[,-1], center = mean, scale = std))
test.x <- data.frame(scale(ctest[,-1], center = mean, scale = std))
train.x$npreg.trunc <- ctrain$npreg
train.x$npreg.trunc[ctrain$npreg>5]<- 5
test.x$npreg.trunc <- ctest$npreg
test.x$npreg.trunc[ctest$npreg>5]<- 5
train.y<- ctrain[,1] # remember that these are normalized (not the npreg.trunc) based on the whole training set
test.y <- ctest[,1]
summary(train.x)
## [1] 0.6383696
## npreg glu bp skin
## Min. :-1.0667 Min. :-2.1986 Min. :-3.8402 Min. :-2.25049
## 1st Qu.:-0.7563 1st Qu.:-0.7606 1st Qu.:-0.5881 1st Qu.:-0.83233
## Median :-0.4459 Median :-0.1921 Median :-0.1003 Median :-0.02195
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4852 3rd Qu.: 0.6774 3rd Qu.: 0.5501 3rd Qu.: 0.78843
## Max. : 4.2097 Max. : 2.5502 Max. : 3.1518 Max. : 3.11828
## bmi ped age npreg.trunc
## Min. :-2.14934 Min. :-1.2509 Min. :-1.0077 Min. :0.00
## 1st Qu.:-0.75274 1st Qu.:-0.7167 1st Qu.:-0.8154 1st Qu.:1.00
## Median :-0.02738 Median :-0.2268 Median :-0.3348 Median :2.00
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean :2.64
## 3rd Qu.: 0.58251 3rd Qu.: 0.5511 3rd Qu.: 0.5543 3rd Qu.:5.00
## Max. : 4.90944 Max. : 5.2651 Max. : 3.7023 Max. :5.00
Q29: Use different methods to analyse the data. In particular use one method from each of Modules 4, 8, 9 and 11. For each method you
resmat=matrix(ncol=3,nrow=8)
# accuracy train, accuracy test, AUC
# for all the methods considered here
colnames(resmat)=c("Accuracy train","Accuracy test","AUC")
rownames(resmat)=c("LDA","logistic regression", "pruned tree","bagged tree","random forest","SVC","SVM","NN h0")
For more details on LDA see Module page 4.
i=1
library(caret)
library(pROC)
library(MASS)
train.xy=data.frame(train.x,diabetes=train.y)
fit=lda(diabetes~npreg+glu+bp+skin+bmi+ped+age, data=train.xy,prior=c(0.5,0.5))
train.res=predict(fit)
test.res = predict(fit, newdata = test.x)
resmat[i,1]=confusionMatrix(train.res$class,factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(test.res$class,factor(test.y))$overall[1]
library(pROC)
roc.lda = roc(test.y,test.res$posterior[,2],legacy.axes=TRUE,auc=TRUE)
resmat[i,3]=auc(roc.lda)
ggroc(roc.lda)+ggtitle("ROC curve")
stepAIC
.For more details on LDA see Module page 4.
i=2
fit=glm(diabetes~npreg+glu+bp+skin+bmi+ped+age, data=train.xy,family="binomial")
summary(fit)
fit.logist2=glm(diabetes~factor(npreg.trunc)+glu+bp+skin+bmi+ped+age, data=train.xy)
summary(fit.logist2)
train.res=predict(fit,type="response")
test.res = predict(fit, newdata = test.x,type="response")
train.class=ifelse(train.res>=0.5,1,0)
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.logist = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]=auc(roc.logist)
ggroc(roc.logist)+ggtitle("ROC curve")
##
## Call:
## glm(formula = diabetes ~ npreg + glu + bp + skin + bmi + ped +
## age, family = "binomial", data = train.xy)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8155 -0.6367 -0.3211 0.6147 2.2408
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.0391 0.1697 -6.123 9.16e-10 ***
## npreg 0.3387 0.2021 1.676 0.093775 .
## glu 1.0641 0.1762 6.039 1.55e-09 ***
## bp -0.1802 0.1720 -1.048 0.294615
## skin 0.2012 0.2031 0.990 0.321962
## bmi 0.6559 0.2166 3.028 0.002458 **
## ped 0.6083 0.1668 3.648 0.000265 ***
## age 0.3984 0.2106 1.891 0.058594 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 381.91 on 299 degrees of freedom
## Residual deviance: 253.84 on 292 degrees of freedom
## AIC: 269.84
##
## Number of Fisher Scoring iterations: 5
##
##
## Call:
## glm(formula = diabetes ~ factor(npreg.trunc) + glu + bp + skin +
## bmi + ped + age, data = train.xy)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.06771 -0.25678 -0.04716 0.28041 0.86103
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2638034 0.0670003 3.937 0.000103 ***
## factor(npreg.trunc)1 0.0139019 0.0786660 0.177 0.859852
## factor(npreg.trunc)2 0.0867121 0.0881124 0.984 0.325889
## factor(npreg.trunc)3 0.1107164 0.0936717 1.182 0.238196
## factor(npreg.trunc)4 -0.0003178 0.1062611 -0.003 0.997616
## factor(npreg.trunc)5 0.1399598 0.0899835 1.555 0.120950
## glu 0.1744862 0.0238551 7.314 2.58e-12 ***
## bp -0.0164161 0.0250178 -0.656 0.512234
## skin 0.0339951 0.0296635 1.146 0.252736
## bmi 0.0872325 0.0302788 2.881 0.004262 **
## ped 0.0818320 0.0227651 3.595 0.000382 ***
## age 0.0645450 0.0310998 2.075 0.038836 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1453613)
##
## Null deviance: 66.667 on 299 degrees of freedom
## Residual deviance: 41.864 on 288 degrees of freedom
## AIC: 286.56
##
## Number of Fisher Scoring iterations: 2
prune.tree
is to use deviance.Since we use cross-validation for the pruning, this is not strictly correct. The scaling should have been done as part of the cv.
In the result below the pruned tree was minimal - only one split, and gave a random guessing ROC-curve! So, it would be better to use the full tree, which gave a better ROC (not included here).
For more details on classification trees see Module page 8.
i=3
library(tree)
fit=tree(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy)
summary(fit)
plot(fit)
text(fit,pretty=0)
cv.fit=cv.tree(fit)
plot(cv.fit$size,cv.fit$dev,type='b')
best=cv.fit$size[which.min(cv.fit$dev)]
prune.fit=prune.tree(fit,best=best)
plot(prune.fit)
text(prune.fit,pretty=0)
train.res=predict(prune.fit)[,2]
test.res=predict(prune.fit,newdata=ctest)[,2]
train.class=ifelse(train.res>=0.5,1,0)
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.tree = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]=auc(roc.tree)
ggroc(roc.tree)+ggtitle("ROC curve")
##
## Classification tree:
## tree(formula = factor(diabetes) ~ npreg + glu + bp + skin + bmi +
## ped + age, data = train.xy)
## Variables actually used in tree construction:
## [1] "glu" "age" "bmi" "ped" "bp" "skin"
## Number of terminal nodes: 20
## Residual mean deviance: 0.4805 = 134.5 / 280
## Misclassification error rate: 0.12 = 36 / 300
For more details on bagged trees see Module page 8.
i=4
library(randomForest)
set.seed(1)
bag=randomForest(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,mtry=7,importance=TRUE)
bag
train.res=predict(bag,type="prob")[,2]
test.res=predict(bag,newdata=test.x,type="prob")[,2]
train.class=ifelse(train.res>=0.5,1,0)
#train.class2=predict(bag,type="response") #same as train.class
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.bag = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]=auc(roc.bag)
ggroc(roc.bag)+ggtitle("ROC curve")
varImpPlot(bag,pch=20)
##
## Call:
## randomForest(formula = factor(diabetes) ~ npreg + glu + bp + skin + bmi + ped + age, data = train.xy, mtry = 7, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 23.33%
## Confusion matrix:
## 0 1 class.error
## 0 167 33 0.165
## 1 37 63 0.370
For more details on random forest see Module page 8.
i=5
library(randomForest)
set.seed(1)
rf=randomForest(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,mtry=3,importance=TRUE) #default is 500 trees
rf
train.res=predict(rf,type="prob")[,2]
test.res=predict(rf,newdata=test.x,type="prob")[,2]
train.class=ifelse(train.res>=0.5,1,0)
#train.class2=predict(rf,type="response") #same as train.class
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.rf = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]=auc(roc.rf)
ggroc(roc.rf)+ggtitle("ROC curve")
varImpPlot(rf,pch=20)
##
## Call:
## randomForest(formula = factor(diabetes) ~ npreg + glu + bp + skin + bmi + ped + age, data = train.xy, mtry = 3, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 22.67%
## Confusion matrix:
## 0 1 class.error
## 0 172 28 0.14
## 1 40 60 0.40
We have only covered boosting for regression, not for classification, so therefore boosting is not added here.
cost
and gamma
(for the radial basis function kernel), found with 10-fold CV over a grid in tune
As for the pruning of the tree, running tune below on scaled data is not strictly correct. The scaling should have been part of the cross-validation.
For more details on SVC and SVM see Module page 9.
i=6
library(e1071)
svcfit=svm(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="linear",cost=1)
summary(svcfit)
# support vector classifier can tune cost function
svcs=tune(svm,factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="linear",ranges=list(cost=c(1e-5,1e-4, 1e-3, 1e-2,1e-1,1,5,10)))
summary(svcs)#$ best is cost 0.01
svcfit=svm(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="linear",cost=0.01)
train.class=predict(svcfit)
test.class=predict(svcfit,newdata=test.x)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
i=7
# support vector machine with radial kernel
set.seed(4268)
svms=tune(svm,factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="linear",ranges=list(cost=c(1e-3, 1e-2,1e-1,1,5,10),gamma=c(0.01,1,5,10)))
summary(svms)
svms # cost=1 and gamma=0.01
svmfit=svm(factor(diabetes)~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,scale=FALSE,kernel="radial",cost=1,gamma=0.01)
train.class=predict(svmfit)
test.class=predict(svmfit,newdata=test.x)
resmat[i,1]=confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]=confusionMatrix(factor(test.class),factor(test.y))$overall[1]
##
## Call:
## svm(formula = factor(diabetes) ~ npreg + glu + bp + skin + bmi +
## ped + age, data = train.xy, kernel = "linear", cost = 1,
## scale = FALSE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
## gamma: 0.1428571
##
## Number of Support Vectors: 144
##
## ( 71 73 )
##
##
## Number of Classes: 2
##
## Levels:
## 0 1
##
##
##
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.01
##
## - best performance: 0.2166667
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-05 0.3333333 0.07200823
## 2 1e-04 0.3333333 0.07200823
## 3 1e-03 0.3333333 0.07200823
## 4 1e-02 0.2166667 0.04513355
## 5 1e-01 0.2300000 0.05762801
## 6 1e+00 0.2300000 0.04830459
## 7 5e+00 0.2300000 0.04830459
## 8 1e+01 0.2300000 0.04830459
##
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 0.01
##
## - best performance: 0.2033333
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 1e-03 0.01 0.3333333 0.05211573
## 2 1e-02 0.01 0.2066667 0.05621827
## 3 1e-01 0.01 0.2133333 0.05018484
## 4 1e+00 0.01 0.2033333 0.05762801
## 5 5e+00 0.01 0.2033333 0.06929985
## 6 1e+01 0.01 0.2033333 0.06929985
## 7 1e-03 1.00 0.3333333 0.05211573
## 8 1e-02 1.00 0.2066667 0.05621827
## 9 1e-01 1.00 0.2133333 0.05018484
## 10 1e+00 1.00 0.2033333 0.05762801
## 11 5e+00 1.00 0.2033333 0.06929985
## 12 1e+01 1.00 0.2033333 0.06929985
## 13 1e-03 5.00 0.3333333 0.05211573
## 14 1e-02 5.00 0.2066667 0.05621827
## 15 1e-01 5.00 0.2133333 0.05018484
## 16 1e+00 5.00 0.2033333 0.05762801
## 17 5e+00 5.00 0.2033333 0.06929985
## 18 1e+01 5.00 0.2033333 0.06929985
## 19 1e-03 10.00 0.3333333 0.05211573
## 20 1e-02 10.00 0.2066667 0.05621827
## 21 1e-01 10.00 0.2133333 0.05018484
## 22 1e+00 10.00 0.2033333 0.05762801
## 23 5e+00 10.00 0.2033333 0.06929985
## 24 1e+01 10.00 0.2033333 0.06929985
##
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 0.01
##
## - best performance: 0.2033333
nnet
only allows sigmoid
activiation of hidden layer, and only one hidden layer.entropy=TRUE
).size=0
and then skip=TRUE
will give weights from input to output layer. Then our setup will be the same as logistic regression.e1071
the tune
function can be used for cross-validation also for nnet
.For more details on neural nets see Module page 11.
library(nnet)
# first just check out calls
fit<- nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,size=5,maxit=5000,linout=FALSE,entropy=TRUE,reltol=1e-14,abstol=1e-14,trace=FALSE)
summary(fit)
fit2<- nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,size=5,maxit=5000,linout=FALSE,entropy=TRUE,reltol=1e-14,abstol=1e-14,trace=FALSE)
summary(fit2)
# seems to be many possible local minima with 46 weights and 300 observations
train.res=predict(fit)
train.class=ifelse(train.res>=0.5,1,0)
test.res=predict(fit,newdata=test.x,type="raw")
test.class=ifelse(test.res>=0.5,1,0)
confusionMatrix(factor(train.class),factor(train.y))$overall[1]
confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.nnet = roc(test.y,test.res,legacy.axes=TRUE)
auc(roc.nnet)
ggroc(roc.nnet)+ggtitle("ROC curve")
## a 7-5-1 network with 46 weights
## options were - entropy fitting
## b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1
## -957.49 -24.44 1645.56 1320.70 -2996.66 1448.31 934.96 2750.45
## b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2
## 851.81 -1687.26 -3472.30 -2940.60 -3601.44 478.51 -2205.83 -1345.00
## b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 i6->h3 i7->h3
## -3880.13 -1624.10 -2239.02 790.07 -320.71 -389.17 -1792.22 -7814.79
## b->h4 i1->h4 i2->h4 i3->h4 i4->h4 i5->h4 i6->h4 i7->h4
## 3233.78 -746.15 4029.66 3026.63 5188.73 2458.61 -638.22 -214.68
## b->h5 i1->h5 i2->h5 i3->h5 i4->h5 i5->h5 i6->h5 i7->h5
## 857.68 1051.00 -2063.77 1027.13 -1736.31 -1302.33 -6291.65 -2091.15
## b->o h1->o h2->o h3->o h4->o h5->o
## -6.06 2.21 4.44 -4.28 6.33 -3.10
## a 7-5-1 network with 46 weights
## options were - entropy fitting
## b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1
## -99.34 16.13 -215.79 109.26 -57.03 179.53 -189.83 -345.37
## b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2
## 12.77 224.54 -16.32 69.92 -109.88 -314.51 -86.23 -281.65
## b->h3 i1->h3 i2->h3 i3->h3 i4->h3 i5->h3 i6->h3 i7->h3
## -172.74 52.00 -0.63 -2.10 -127.68 250.39 119.53 16.46
## b->h4 i1->h4 i2->h4 i3->h4 i4->h4 i5->h4 i6->h4 i7->h4
## -37.60 -75.04 -260.51 -140.91 8.30 594.31 -123.51 247.80
## b->h5 i1->h5 i2->h5 i3->h5 i4->h5 i5->h5 i6->h5 i7->h5
## -594.51 -377.70 -164.25 -70.91 -544.19 46.73 357.75 -76.61
## b->o h1->o h2->o h3->o h4->o h5->o
## 5.38 -3.39 -5.59 3.95 -5.21 -6.17
## Accuracy
## 0.9
## Accuracy
## 0.7241379
## Area under the curve: 0.774
# ready to choose a good number of hidden nodes with 5-fold CV
grid=c(1:10,15,20,25,30,50)
k <- 5
set.seed(123)
indices <- sample(1:nrow(train.x))
folds <- cut(indices, breaks = k, labels = FALSE)
table(folds) # 60 in each fold
res=matrix(NA,ncol=k,nrow=length(grid)+1) #adding 0 also
for (j in 1:k)
{
thistrain=(1:dim(train.x)[1])[folds!=j]
thisvalid=(1:dim(train.x)[1])[folds==j]
mean <- apply(ctrain[thistrain,], 2, mean)
std <- apply(ctrain[thistrain,], 2, sd)
new <- scale(ctrain, center = mean, scale = std)
for (i in 1:length(grid))
{
thissize=grid[i]
fit=nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age, data=new, size=thissize,subset=thistrain,linout=FALSE,maxit=5000,entropy=TRUE,
reltol=1e-14,abstol=1e-14)
pred=predict(fit,newdata=new[thisvalid,],type="raw")
valid.class=ifelse(pred>=0.5,1,0)
res[i,j]=sum(abs(train.xy[thisvalid,]$diabetes-valid.class))
}
fit=nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age, data=new, size=0,skip=TRUE,subset=thistrain,linout=FALSE,maxit=10000,entropy=TRUE,
reltol=1e-14,abstol=1e-14)
pred=predict(fit,newdata=new[thisvalid,],type="raw")
valid.class=ifelse(pred>=0.5,1,0)
res[i+1,j]=sum(abs(train.xy[thisvalid,]$diabetes-valid.class))
}
misclass=apply(res,1,sum)/nrow(train.x)
misclass
# 0 hidden nodes won! But be aware that there are multiple minima.
## [1] 0.3333333 0.3333333 0.3333333 0.3033333 0.2900000 0.2666667 0.2733333
## [8] 0.2600000 0.2166667 0.3233333 0.2833333 0.2433333 0.2333333 0.2500000
## [15] 0.2300000 0.2133333
i=8
fit<- nnet(diabetes~npreg+glu+bp+skin+bmi+ped+age,data=train.xy,size=0,skip=TRUE,linout=FALSE,entropy=TRUE,maxit=5000,reltol=1e-14,abstol=1e-14)
summary(fit)
train.res=predict(fit)
train.class=ifelse(train.res>=0.5,1,0)
test.res=predict(fit,newdata=test.x)
test.class=ifelse(test.res>=0.5,1,0)
resmat[i,1]<-confusionMatrix(factor(train.class),factor(train.y))$overall[1]
resmat[i,2]<-confusionMatrix(factor(test.class),factor(test.y))$overall[1]
roc.nnet = roc(test.y,test.res,legacy.axes=TRUE)
resmat[i,3]<-auc(roc.nnet)
## # weights: 8
## initial value 229.710090
## iter 10 value 127.918371
## final value 126.919747
## converged
## a 7-0-1 network with 8 weights
## options were - skip-layer connections entropy fitting
## b->o i1->o i2->o i3->o i4->o i5->o i6->o i7->o
## -1.04 0.34 1.06 -0.18 0.20 0.66 0.61 0.40
Q30: Conclude with choosing a winning method, and explain why you mean that this is the winning method.
print(resmat)
plot(roc.lda,col=1) #solid
plot(roc.logist,add=TRUE,lty="dashed",col=1)
plot(roc.tree,add=TRUE,lty="solid",col=2)
plot(roc.bag,add=TRUE,lty="dashed",col=2)
plot(roc.rf,add=TRUE,lty="dotted",col=2)
plot(roc.nnet,add=TRUE,lty="solid",col=3)
## Accuracy train Accuracy test AUC
## LDA 0.7733333 0.7887931 0.8490155
## logistic regression 0.7966667 0.7974138 0.8450775
## pruned tree 0.8266667 0.3318966 0.5000000
## bagged tree 0.7666667 0.7629310 0.8277755
## random forest 0.7766667 0.7672414 0.8346460
## SVC 0.8100000 0.7758621 NA
## SVM 0.7966667 0.7758621 NA
## NN h0 0.7966667 0.7974138 0.8450775
The logistic regression might be regarded as the winning method based both on interpretability, misclassification rate and AUC.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics New York.
Wickham, Hadley. 2016. Ggplot2 Elegant Graphics for Data Analysis. Springer.