(12.04: added extra supervision times, 16.04: added random seed in svm-task.)
Maximal score is 10 points. You need a score of 4/10 for the exercise to be approved. Your score will make up 10% points of your final grade.
echo=TRUE
and eval=TRUE
to make it simpler for us to read and grade.Template file for submission of Compulsory exercise 3 is here: https://www.math.ntnu.no/emner/TMA4268/2018v/CompEx3mal.Rmd
You need to install the following packages in R:
install.packages("caret")
install.packages("tree")
install.packages("pROC")
install.packages("RandomForest")
install.packages("corrplot")
install.packages("knitr")
install.packages("e1071")
install.packages("keras")
The keras
package will call Python, so Python must be installed on your machine. After installing the keras
package, the installation of keras and dependencies should be done only once:
library(keras) # Load R package Keras
install_keras() # Install Keras and dependencies
We will use the German credit data set from the UC Irvine machine learning repository. Our aim is to classify a customer as good or bad with respect to credit risk. A set of 20 covariates (attributes) are available (both numerical and categorical) for 300 customers with bad credit risk and 700 customers with good credit risk.
More information on the 20 covariates are found that the UCI archive data set description
library(caret)
#read data, divide into train and test
germancredit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
colnames(germancredit) = c("checkaccount", "duration", "credithistory", "purpose", "amount", "saving", "presentjob", "installmentrate", "sexstatus", "otherdebtor", "resident", "property", "age", "otherinstall", "housing", "ncredits", "job", "npeople", "telephone", "foreign", "response")
germancredit$response = as.factor(germancredit$response) #2=bad
table(germancredit$response)
str(germancredit) # to see factors and integers, numerics
set.seed(4268) #keep this -easier to grade work
in.train <- createDataPartition(germancredit$response, p=0.75, list=FALSE)
# 75% for training, one split
germancredit.train <- germancredit[in.train,]; dim(germancredit.train)
germancredit.test <- germancredit[-in.train,];dim(germancredit.test)
##
## 1 2
## 700 300
## 'data.frame': 1000 obs. of 21 variables:
## $ checkaccount : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credithistory : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ saving : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ presentjob : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ installmentrate: int 4 2 2 2 3 2 3 2 2 4 ...
## $ sexstatus : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ otherdebtor : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ resident : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ otherinstall : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ ncredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ npeople : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ response : Factor w/ 2 levels "1","2": 1 2 1 1 2 1 1 1 1 2 ...
## [1] 750 21
## [1] 250 21
We will now look at classification trees, bagging, and random forests.
Remark: in description of the data set it is hinted that we may use unequal cost of misclassification for the two classes, but we have not covered unequal misclassification costs in this course, and will therefore not address that in this problem set.
# construct full tree
library(tree)
library(pROC)
fulltree=tree(response~.,germancredit.train,split="deviance")
summary(fulltree)
plot(fulltree)
text(fulltree)
print(fulltree)
fullpred=predict(fulltree,germancredit.test,type="class")
testres=confusionMatrix(data=fullpred,reference=germancredit.test$response)
print(testres)
1-sum(diag(testres$table))/(sum(testres$table))
predfulltree = predict(fulltree,germancredit.test, type = "vector")
testfullroc=roc(germancredit.test$response == "2", predfulltree[,2])
auc(testfullroc)
plot(testfullroc)
Run the code and study the output.
fulltree
is constructed. The explanation should include the words: greedy, binary, deviance, root, leaves.# prune the full tree
set.seed(4268)
fullcv=cv.tree(fulltree,FUN=prune.misclass,K=5)
plot(fullcv$size,fullcv$dev,type="b", xlab="Terminal nodes",ylab="misclassifications")
print(fullcv)
prunesize=fullcv$size[which.min(fullcv$dev)]
prunetree=prune.misclass(fulltree,best=prunesize)
plot(prunetree)
text(prunetree,pretty=1)
predprunetree = predict(prunetree,germancredit.test, type = "class")
prunetest=confusionMatrix(data=predprunetree,reference=germancredit.test$response)
print(prunetest)
1-sum(diag(prunetest$table))/(sum(prunetest$table))
predprunetree = predict(prunetree,germancredit.test, type = "vector")
testpruneroc=roc(germancredit.test$response == "2", predprunetree[,2])
auc(testpruneroc)
plot(testpruneroc)
Run the code and study the output.
library(randomForest)
set.seed(4268)
bag=randomForest(response~., data=germancredit,subset=in.train,
mtry=20,ntree=500,importance=TRUE)
bag$confusion
1-sum(diag(bag$confusion))/sum(bag$confusion[1:2,1:2])
yhat.bag=predict(bag,newdata=germancredit.test)
misclass.bag=confusionMatrix(yhat.bag,germancredit.test$response)
print(misclass.bag)
1-sum(diag(misclass.bag$table))/(sum(misclass.bag$table))
predbag = predict(bag,germancredit.test, type = "prob")
testbagroc=roc(germancredit.test$response == "2", predbag[,2])
auc(testbagroc)
plot(testbagroc)
varImpPlot(bag,pch=20)
Run the code and study the output.
set.seed(4268)
rf=randomForest(response~.,
data=germancredit,subset=in.train,
mtry=4,ntree=500,importance=TRUE)
rf$confusion
1-sum(diag(rf$confusion))/sum(rf$confusion[1:2,1:2])
yhat.rf=predict(rf,newdata=germancredit.test)
misclass.rf=confusionMatrix(yhat.rf,germancredit.test$response)
print(misclass.rf)
1-sum(diag(misclass.rf$table))/(sum(misclass.rf$table))
predrf = predict(rf,germancredit.test, type = "prob")
testrfroc=roc(germancredit.test$response == "2", predrf[,2])
auc(testrfroc)
plot(testrfroc)
varImpPlot(rf,pch=20)
Run the code and study the output.
mtry=4
is used. What does this parameter mean, and what is the motivation behind choosing exactly this value?mtry
is the only difference between bagging and random forest. What is the effect of choosing mtry
to be a value less than the number of covariates?We will study classification applied to a simulated data set with two classes from @ESL, where the data set is supplied together with the Bayes decision boundary. The boundary is plotted below, together with a training set.
load(url("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/ESL.mixture.rda"))
#names(ESL.mixture)
#prob gives probabilites for each class when the true density functions are known
#px1 and px2 are coordinates in x1 (length 69) and x2 (length 99) where class probabilites are calculated
rm(x,y)
attach(ESL.mixture)
dat=data.frame(y=factor(y),x)
xgrid=expand.grid(X1=px1,X2=px2)
par(pty="s")
plot(xgrid, pch=20,cex=.2)
points(x,col=y+1,pch=20)
contour(px1,px2,matrix(prob,69,99),level=0.5,add=TRUE,col="blue",lwd=2) #optimal boundary
library(e1071)
# support vector classifier
svcfits=tune(svm,factor(y)~.,data=dat,scale=FALSE,kernel="linear",ranges=list(cost=c(1e-2,1e-1,1,5,10)))
summary(svcfits)
svcfit=svm(factor(y)~.,data=dat,scale=FALSE,kernel="linear",cost=0.01)
# support vector machine with radial kernel
set.seed(4268)
svmfits=tune(svm,factor(y)~.,data=dat,scale=FALSE,kernel="radial",ranges=list(cost=c(1e-2,1e-1,1,5,10),gamma=c(0.01,1,5,10)))
summary(svmfits)
svmfit=svm(factor(y)~.,data=dat,scale=FALSE,kernel="radial",cost=1,gamma=5)
# the same as in a - the Bayes boundary
par(pty="s")
plot(xgrid, pch=20,cex=.2)
points(x,col=y+1,pch=20)
contour(px1,px2,matrix(prob,69,99),level=0.5,add=TRUE,col="blue",lwd=2) #optimal boundary
# decision boundaries from svc and svm added
svcfunc=predict(svcfit,xgrid,decision.values=TRUE)
svcfunc=attributes(svcfunc)$decision
contour(px1,px2,matrix(svcfunc,69,99),level=0,add=TRUE,col="red") #svc boundary
svmfunc=predict(svmfit,xgrid,decision.values=TRUE)
svmfunc=attributes(svmfunc)$decision
contour(px1,px2,matrix(svmfunc,69,99),level=0,add=TRUE,col="orange") #svm boundary
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 0.1
##
## - best performance: 0.285
##
## - Detailed performance results:
## cost error dispersion
## 1 0.01 0.300 0.08819171
## 2 0.10 0.285 0.08181958
## 3 1.00 0.290 0.11254629
## 4 5.00 0.290 0.11254629
## 5 10.00 0.290 0.11254629
##
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost gamma
## 1 5
##
## - best performance: 0.155
##
## - Detailed performance results:
## cost gamma error dispersion
## 1 0.01 0.01 0.455 0.17551511
## 2 0.10 0.01 0.440 0.16964014
## 3 1.00 0.01 0.295 0.09264628
## 4 5.00 0.01 0.290 0.10219806
## 5 10.00 0.01 0.295 0.09264628
## 6 0.01 1.00 0.435 0.19155794
## 7 0.10 1.00 0.210 0.07745967
## 8 1.00 1.00 0.170 0.10327956
## 9 5.00 1.00 0.165 0.08181958
## 10 10.00 1.00 0.180 0.07527727
## 11 0.01 5.00 0.395 0.22785229
## 12 0.10 5.00 0.315 0.20145305
## 13 1.00 5.00 0.155 0.06851602
## 14 5.00 5.00 0.190 0.06992059
## 15 10.00 5.00 0.200 0.09128709
## 16 0.01 10.00 0.430 0.19032137
## 17 0.10 10.00 0.415 0.18265024
## 18 1.00 10.00 0.175 0.06770032
## 19 5.00 10.00 0.245 0.10394977
## 20 10.00 10.00 0.260 0.11005049
We will study data on consumption of different beverages for a selection of countries, and perform principal component analysis.
# reading data on consumption of different beverages for countries
drink <- read.csv("https://www.math.ntnu.no/emner/TMA4267/2017v/drikke.TXT",sep=",",header=TRUE)
drink <- na.omit(drink)
# looking at correlation between consumptions
drinkcorr=cor(drink)
library(corrplot)
corrplot(drinkcorr,method="circle")
# now for PCA
pcaS <- prcomp(drink,scale=TRUE) # scale: variables are scaled
pcaS$rotation
summary(pcaS)
biplot(pcaS,scale=0,cex=0.6) # scale=0: arrows scaled to represent the loadings
## PC1 PC2 PC3 PC4 PC5
## Coffee -0.26029733 0.66788815 -0.22475187 0.4132467433 0.07431918
## Tea 0.65540048 -0.09539757 0.36756357 -0.0002927055 -0.12503940
## Cocoa 0.23510209 0.57754726 -0.06603093 -0.4200858712 -0.61199325
## Liquer 0.02190508 -0.32118904 -0.79997824 -0.3292322714 -0.12307455
## Wine -0.50599685 0.06551597 0.37109534 -0.6765579799 0.15862233
## Beer 0.43693234 0.32219426 -0.17985159 -0.2943302832 0.75099779
## PC6
## Coffee 0.5092751
## Tea 0.6407898
## Cocoa -0.2362164
## Liquer 0.3644878
## Wine 0.3450672
## Beer -0.1493533
## Importance of components%s:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.3117 1.1957 1.0681 0.8793 0.8576 0.44782
## Proportion of Variance 0.2867 0.2383 0.1901 0.1288 0.1226 0.03342
## Cumulative Proportion 0.2867 0.5250 0.7151 0.8440 0.9666 1.00000
biplot
in relation to the loadings for the first two principal components.Within the research area of phylogeny similarities between species are studied, and below you find a distance matrix for five species. The distances are calculated from mitochondrial DNA sequences from the species.
library(knitr)
species=c("Human","Chimpanzee","Gorilla","Orangutan","Gibbon")
distJC <- matrix(c(0,1,3,9,12,
1,0,2,8,11,
3,2,0,6,11,
9,8,6,0,11,
12,11,11,11,0),5,5)
dimnames(distJC) <- list(species,species)
kable(distJC)
Human | Chimpanzee | Gorilla | Orangutan | Gibbon | |
---|---|---|---|---|---|
Human | 0 | 1 | 3 | 9 | 12 |
Chimpanzee | 1 | 0 | 2 | 8 | 11 |
Gorilla | 3 | 2 | 0 | 6 | 11 |
Orangutan | 9 | 8 | 6 | 0 | 11 |
Gibbon | 12 | 11 | 11 | 11 | 0 |
Above you see three dendrogrammes labelled A, B and C, constructed from the distance matrix above. The dendrograms show agreement on how to cluster the species, but differ in branch lenghts when species are fused together in clusters.
During our lectures, we have used a densely connected neural network to classify a movie review as either positive or negative using the IMDB data set. We will use the same data here, refer to the lecture slides for reading the data set and setting up the network.
More specifically, we have defined a linear stack of dense layers, containing two intermediate layers containing 16 hidden units each with reLu activation functions.
model <- keras_model_sequential() %>%
layer_dense(units = 16, activation = "relu", input_shape = c(10000)) %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dense(units = 1, activation = "sigmoid")
relu
?sigmoid
) in the output layer instead of using relu
again?In one of our lectures, we have fitted the model above and plotted training and validation loss as well as accuracy, as illustrated in the figure below.
Keep two intermediate layers in the network. Try creating a simpler model with 4 hidden units instead of 16 and a more complex model with 32 hidden units instead of 16.