(Last changes: 13.03 first version)
Maximal score is 15 points. You need a score of 6/15 for the exercise to be approved. Your score will make up 15% points of your final grade.
All supervision is in Smia.
echo=TRUE
and eval=TRUE
to make it simpler for us to read and grade.$
to start and stop math mode, or $$
to start and end equation mode should be sufficient for what you need.You need to install the following packages in R to run the code in this file.
install.packages("knitr") #probably already installed
install.packages("rmarkdown") #probably already installed
install.packages("bestglm")# for subset selection with categorical variables
install.packages("glmnet")# for lasso
install.packages("tree") #tree
install.packages("randomForest") #for random forest
install.packages("ElemStatLearn") #dataset in Problem 2
BiocManager::install(c("pheatmap")) #heatmap in Problem 2
The following are packages that might be useful.
install.packages("ggplot2")
install.packages("GGally") # for ggpairs
install.packages("caret") #for confusion matrices
install.packages("pROC") #for ROC curves
install.packages("e1071") # for support vector machines
install.packages("nnet") # for feed forward neural networks
In this problem we will work with understanding and applying different regression methods, and will use the diamonds
data set from the ggplot2
package. The diamonds
data set consists of
price
(in US dollars) - which will be our response,and quality information (9 covariates) for around 54000 diamonds. According to Wickham (2016) (Section 3.10) there are four C’s of diamond quality:
carat
(weight),cut
(quality of the cut: Fair/Good/VeryGood/Premium/Ideal),colour
(from worst J to best D) andclarity
(from worst to best: I1, SI1, SI2, VS1, VVS1, VVS2, IF).In addition there are five physical measurements:
depth
(total depth percentage, calculated from x, y and z),table
(width of top of diamond relative to widest point),xx
(length in mm),yy
(width in mm) andzz
(depth in mm)explained with a drawing in Figure 3.1 of the Wickham (2016) book (ebook available for free for NTNU students). There are no missing data. In addition we have made two additional transformed variables:
logprice
: logarithm (base 10) of the price, andlogcarat
: logarithm (base 10) of the carat.To be able to do the analyses without too much hassle (running time for analyses and plotting) we constructed a smaller version of the data set, and divided it into a training set dtrain
and a test set dtest
, to be loaded from the course web-page as follows.
all=dget("https://www.math.ntnu.no/emner/TMA4268/2019v/data/diamond.dd")
dtrain=all$dtrain
dtest=all$dtest
Our aim will be to predict the price of the diamond, based on some or all of the covariates. We would also like to understand which of the covariates are important in predicting the price and how the covariates are related to the price.
Q1: Would you choose price
or logprice
as response variable? Justify your choice. Next, plot your choice of response pairwise with carat
, logcarat
, color
, clarity
and cut
. Comment.
logprice
and carat
Use the local regression model \(\texttt{logprice} = \beta_0 + \beta_1 \texttt{carat}+ \beta_2 \texttt{carat}^2\) weighted by the tricube kernel \(K_{i0}\).
Q2: What is the predicted price of a diamond weighting 1 carat. Use the closest 20% of the observations.
Q3: What choice of \(\beta_1\), \(\beta_2\) and \(K_{i0}\) would result in KNN-regression?
Q4: Describe how you can perform model selection in regression with AIC as criterion.
Q5: What are the main differences between using AIC for model selection and using cross-validation (with mean squared test error MSE)?
Q6: See the code below for performing model selection with bestglm()
based on AIC. What kind of contrast is used to represent cut
, color
and clarity
? Write down the final best model and explain what you can interpret from the model.
Q7: Calculate and report the MSE of the test set using the best model (on the scale of logprice
).
library(bestglm)
ds=as.data.frame(within(dtrain,{
y=logprice # setting reponse
logprice=NULL # not include as covariate
price=NULL # not include as covariate
carat=NULL # not include as covariate
}))
fit=bestglm(Xy=ds, IC="AIC")$BestModel
summary(fit)
Q8: Build a model matrix for the covariates ~logcarat+cut+clarity+color+depth+table+xx+yy+zz-1
. What is the dimension of this matrix?
Q9: Fit a lasso regression to the diamond data with logprice
as the response and the model matrix given in Q8. How did you find the value to be used for the regularization parameter?
Q10: Calculate and report the MSE of the test set (on the scale of logprice
).
Q11: A regression tree to model is built using a greedy approach. What does that mean? Explain the strategy used for constructing a regression tree.
Q12: Is a regression tree a suitable method to handle both numerical and categorical covariates? Elaborate.
Q13: Fit a (full) regression tree to the diamond data with logprice
as the response (and the same covariates as for c and d), and plot the result. Comment briefly on you findings.
Q14: Calculate and report the MSE of the test set (on the scale of logprice
).
Q15: Explain the motivation behind bagging, and how bagging differs from random forest? What is the role of bootstrapping?
Q16: What are the parameter(s) to be set in random forest, and what are the rules for setting these?
Q17: Boosting is a popular method. What is the main difference between random forest and boosting?
Q18: Fit a random forest to the diamond data with logprice
as the response (and the same covariates as before). Comment on your choice of parameter (as decribed in Q16).
Q19: Make a variable importance plot and comment on the plot. Calculate and report the MSE of the test set (on the scale of logprice
).
Q20: Finally, compare the results from c (subset selection), d (lasso), e (tree) and f (random forest): Which method has given the best performance on the test set and which method has given you the best insight into the relationship between the price and the covariates?
We will use a data set called nci
from the Friedman, Hastie, and Tibshirani (2001) book. The data set consists of measurement on the activity of \(p=6830\) genes for \(n=64\) samples from cancer tumors. Our aim here is to use unsupervised methods to study the relationship among the cancer samples. The data is on a transformed scale, where a negative number means that the gene has a lower regulation than some reference sample and a positive number gives a higer regularization. The following names are given for the cancer tumors (R print-out). It is known that K562 cells are leukemia cells and MCF7 are breast cancer cells. Observe the one sample labelled as UNKNOWN
. More information: http://genome-www.stanford.edu/nci60/
Let \({\bf X}\) be a \(n\times p\) matrix of these observations, and \({\bf Z}\) a standardized version, where column (gene) means are 0 and column standard deviations are 1. Further, let \({\bf R}\) be the \(p \times p\) covariance matrix of the gene-standardized observations, which can be estimated as
\[ {\hat {\bf R}}=\frac{1}{n-1}\sum_{i=1}^n ({\bf Z}_i-\bar{\bf Z})({\bf Z}_i-\bar{\bf Z})^T\] where \(\bar{\bf Z}=\frac{1}{n}\sum_{i=1}^n {\bf Z}_i\).
Studying the relationship between the different tumor samples is hard due the the high dimension of the gene data, and we know that the gene expressions of different genes may be correlated. One way to proceed is to calculate principal components, which we find using the eigenvectors of the covariance matrix \({\hat {\bf R}}\).
Q21: What is the definition of a principal component score, and how is the score related to the eigenvectors of the matrix \({\hat {\bf R}}\).
Study the R-code below. The eigenvectors of \({\hat {\bf R}}\) are found in pca$rotation
and the principal component scores in pca$x
.
Q22: Explain what is given in the plot with title “First eigenvector”. Why are there only \(n=64\) eigenvectors and \(n=64\) principal component scores?
Q23: How many principal components are needed to explain 80% of the total variance in \({\bf Z}\)? Why is sum(pca$sdev^2)=p
?
Hint: Total variance is given as the trace of \({\hat {\bf R}}\), the trace of a square matrix is the sum of the diagonal elements.
Q24: Study the PC1 vs PC2 plot, and comment on the groupings observed. What can you say about the placement of the K262
, MCF7
and UNKNOWN
samples? Produce the same plot for two other pairs of PCs and comment on your observations.
library(ElemStatLearn)
X=t(nci) #n times p matrix
ngroups=length(table(rownames(X)))
cols=rainbow(ngroups)
cols[c(4,5,7,8,14)] = "black"
pch.vec = rep(4,14)
pch.vec[c(4,5,7,8,14)] = 15:19
colsvsnames=cbind(cols,sort(unique(rownames(X))))
colsamples=cols[match(rownames(X),colsvsnames[,2])]
pchvsnames=cbind(pch.vec,sort(unique(rownames(X))))
pchsamples=pch.vec[match(rownames(X),pchvsnames[,2])]
Z=scale(X)
pca=prcomp(Z)
plot(pca$x[,1],pca$x[,2],xlab="PC1",ylab="PC2",pch=pchsamples,col=colsamples)
legend("bottomright",legend = colsvsnames[,2],cex=0.55,col=cols,pch = pch.vec)
plot(1:dim(X)[2],pca$rotation[,1],type="l",xlab="genes",ylab="weight",main="First eigenvector")
Hierarchical clustering can be used to group the cancer samples, and then we may see which samples that are most similar to the five samples named K262
, MCF7
and UNKNOWN
.
Q25:: Explain what it means to use Euclidean distance and average linkage for hierarchical clustering.
Q26:: Perform hierarchical clustering with Euclidean distance and average linkage on the scaled gene expression in Z
. Observe where our samples labelled as K562, MCF7 and UNKNOWN
are placed in the dendrogram. Which conclusions can you draw from this?
Q27:: Study the R-code and plot below. Here we have performed hierarchical clustering based on thefirst 64 principal component instead of the gene expression data in Z
. What is the difference between using all the gene expression data and using the first 64 principal components in the clustering? We have plotted the dendrogram together with a heatmap of the data. Explain what is shown in the heatmap. What is given on the horizontal axis, vertical axis, value in the pixel grid?
library(pheatmap)
npcs=64
pheatmap(pca$x[,1:npcs],scale="none",cluster_col=FALSE,cluster_row=TRUE,clustering_distance_rows = "euclidean",clustering_method="average",fontsize_row=5,fontsize_col=5)
A learning objective in this course is: The student knows, based on an existing data set, how to choose a suitable statistical model, apply sound statistical methods, and perform the analyses using statistical software. The student knows how to present the results from the statistical analyses, and which conclusions can be drawn from the analyses.
This problem addresses that objective.
We will use the classical data set of diabetes from a population of women of Pima Indian heritage in the US, available in the R MASS
package. The following information is available for each woman:
0
= not present, 1
= presentWe will use a training set (called ctrain
) with 300 observations (200 non-diabetes and 100 diabetes cases) and a test set (called ctest
) with 232 observations (155 non-diabetes and 77 diabetes cases). Our aim is to make a classification rule for diabetes (or not) based on the available data.
flying=dget("https://www.math.ntnu.no/emner/TMA4268/2019v/data/flying.dd")
ctrain=flying$ctrain
ctest=flying$ctest
Q28: Start by getting to know the training data, by producing summaries and plots. Write a few sentences about what you observe and include your top 3 informative plots and/or outputs.
Q29: Use different methods to analyse the data. In particular use
For each method you
Q30: Conclude with choosing a winning method, and explain why you mean that this is the winning method.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2001. The Elements of Statistical Learning. Vol. 1. Springer series in statistics New York.
Wickham, Hadley. 2016. Ggplot2 Elegant Graphics for Data Analysis. Springer.