Compulsory exercise 3, 2018, Problem 1 - Classification with trees
We will use the German credit data set from the UC Irvine machine learning repository. Our aim is to classify a customer as good or bad with respect to credit risk. A set of 20 covariates (attributes) are available (both numerical and categorical) for 300 customers with bad credit risk and 700 customers with good credit risk.
More information on the 20 covariates are found that the UCI archive data set description
library(caret)
# read data, divide into train and test
germancredit = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data")
colnames(germancredit) = c("checkaccount", "duration", "credithistory",
"purpose", "amount", "saving", "presentjob", "installmentrate", "sexstatus",
"otherdebtor", "resident", "property", "age", "otherinstall", "housing",
"ncredits", "job", "npeople", "telephone", "foreign", "response")
germancredit$response = as.factor(germancredit$response) #2=bad
table(germancredit$response)
##
## 1 2
## 700 300
str(germancredit) # to see factors and integers, numerics
## 'data.frame': 1000 obs. of 21 variables:
## $ checkaccount : Factor w/ 4 levels "A11","A12","A13",..: 1 2 4 1 1 4 4 2 4 2 ...
## $ duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credithistory : Factor w/ 5 levels "A30","A31","A32",..: 5 3 5 3 4 3 3 3 3 5 ...
## $ purpose : Factor w/ 10 levels "A40","A41","A410",..: 5 5 8 4 1 8 4 2 5 1 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ saving : Factor w/ 5 levels "A61","A62","A63",..: 5 1 1 1 1 5 3 1 4 1 ...
## $ presentjob : Factor w/ 5 levels "A71","A72","A73",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ installmentrate: int 4 2 2 2 3 2 3 2 2 4 ...
## $ sexstatus : Factor w/ 4 levels "A91","A92","A93",..: 3 2 3 3 3 3 3 3 1 4 ...
## $ otherdebtor : Factor w/ 3 levels "A101","A102",..: 1 1 1 3 1 1 1 1 1 1 ...
## $ resident : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 4 levels "A121","A122",..: 1 1 1 2 4 4 2 3 1 3 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ otherinstall : Factor w/ 3 levels "A141","A142",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ housing : Factor w/ 3 levels "A151","A152",..: 2 2 2 3 3 3 2 1 2 2 ...
## $ ncredits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ job : Factor w/ 4 levels "A171","A172",..: 3 3 2 3 3 2 3 4 2 4 ...
## $ npeople : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : Factor w/ 2 levels "A191","A192": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign : Factor w/ 2 levels "A201","A202": 1 1 1 1 1 1 1 1 1 1 ...
## $ response : Factor w/ 2 levels "1","2": 1 2 1 1 2 1 1 1 1 2 ...
set.seed(4268) #keep this -easier to grade work
in.train <- createDataPartition(germancredit$response, p = 0.75, list = FALSE)
# 75% for training, one split
germancredit.train <- germancredit[in.train, ]
dim(germancredit.train)
## [1] 750 21
germancredit.test <- germancredit[-in.train, ]
dim(germancredit.test)
## [1] 250 21
We will now look at classification trees, bagging, and random forests.
Remark: in description of the data set it is hinted that we may use unequal cost of misclassification for the two classes, but we have not covered unequal misclassification costs in this course, and will therefore not address that in this problem set.
a) Full classification tree [1 point]
# construct full tree
library(tree)
library(pROC)
fulltree = tree(response ~ ., germancredit.train, split = "deviance")
summary(fulltree)
plot(fulltree)
text(fulltree)
print(fulltree)
fullpred = predict(fulltree, germancredit.test, type = "class")
testres = confusionMatrix(data = fullpred, reference = germancredit.test$response)
print(testres)
1 - sum(diag(testres$table))/(sum(testres$table))
predfulltree = predict(fulltree, germancredit.test, type = "vector")
testfullroc = roc(germancredit.test$response == "2", predfulltree[, 2])
auc(testfullroc)
plot(testfullroc)
Run the code and study the output.
- Q1. Explain briefly how
fulltree
is constructed. The explanation should include the words: greedy, binary, deviance, root, leaves.
b) Pruned classification tree [1 point]
# prune the full tree
set.seed(4268)
fullcv = cv.tree(fulltree, FUN = prune.misclass, K = 5)
plot(fullcv$size, fullcv$dev, type = "b", xlab = "Terminal nodes", ylab = "misclassifications")
print(fullcv)
prunesize = fullcv$size[which.min(fullcv$dev)]
prunetree = prune.misclass(fulltree, best = prunesize)
plot(prunetree)
text(prunetree, pretty = 1)
predprunetree = predict(prunetree, germancredit.test, type = "class")
prunetest = confusionMatrix(data = predprunetree, reference = germancredit.test$response)
print(prunetest)
1 - sum(diag(prunetest$table))/(sum(prunetest$table))
predprunetree = predict(prunetree, germancredit.test, type = "vector")
testpruneroc = roc(germancredit.test$response == "2", predprunetree[,
2])
auc(testpruneroc)
plot(testpruneroc)
Run the code and study the output.
- Q2. Why do we want to prune the full tree?
- Q3. How is amount of pruning decided in the code?
- Q4. Compare the the full and pruned tree classification method with focus on interpretability and the ROC curves (AUC).
c) Bagged trees [1 point]
library(randomForest)
set.seed(4268)
bag = randomForest(response ~ ., data = germancredit, subset = in.train,
mtry = 20, ntree = 500, importance = TRUE)
bag$confusion
1 - sum(diag(bag$confusion))/sum(bag$confusion[1:2, 1:2])
yhat.bag = predict(bag, newdata = germancredit.test)
misclass.bag = confusionMatrix(yhat.bag, germancredit.test$response)
print(misclass.bag)
1 - sum(diag(misclass.bag$table))/(sum(misclass.bag$table))
predbag = predict(bag, germancredit.test, type = "prob")
testbagroc = roc(germancredit.test$response == "2", predbag[, 2])
auc(testbagroc)
plot(testbagroc)
varImpPlot(bag, pch = 20)
Run the code and study the output.
- Q5. What is the main motivation behind bagging?
- Q6. Explain what the importance plots show, and give your interpretation for the data set.
- Q7. Compare the performance of bagging with the best of the full and pruned tree model above with focus on interpretability and the ROC curves (AUC).
d) Random forest [1 point]
set.seed(4268)
rf = randomForest(response ~ ., data = germancredit, subset = in.train,
mtry = 4, ntree = 500, importance = TRUE)
rf$confusion
1 - sum(diag(rf$confusion))/sum(rf$confusion[1:2, 1:2])
yhat.rf = predict(rf, newdata = germancredit.test)
misclass.rf = confusionMatrix(yhat.rf, germancredit.test$response)
print(misclass.rf)
1 - sum(diag(misclass.rf$table))/(sum(misclass.rf$table))
predrf = predict(rf, germancredit.test, type = "prob")
testrfroc = roc(germancredit.test$response == "2", predrf[, 2])
auc(testrfroc)
plot(testrfroc)
varImpPlot(rf, pch = 20)
Run the code and study the output.
- Q8. The parameter
mtry=4
is used. What does this parameter mean, and what is the motivation behind choosing exactly this value?
- Q9. The value of the parameter
mtry
is the only difference between bagging and random forest. What is the effect of choosing mtry
to be a value less than the number of covariates?
- Q10. Would you prefer to use bagging or random forest to classify the credit risk data?