MA8701 Advanced methods in statistical inference and learning
Part 3: Ensembles. L15: Stacked ensembles
Course homepage: https://wiki.math.ntnu.no/ma8701/2023v/start
Before we start
Literature
Supporting literature
Ensembles - overview
(ELS Ch 16.1)
With ensembles we want to build one prediction model which combines the strength of a collection of models.
These models may be simple base models - or more elaborate models.
We have studied bagging - where we use the bootstrap to repeatedly fit a statistical model, and then take a simple average of the predictions (or majority vote). Here the base models can be trees - or other type of models.
Random forest is a version of bagging with trees, with trees made to be different (decorrelated).
We have studied boosting, where the models are trained on sequentially different data - from residuals or gradients of loss functions - and the ensemble members cast weighted votes (downweighted by a learning rate). We have observed that there are many hyperparameters that need to be to tuned to optimize performance.
Stacked ensembles
aka super learner or generalized stacking
What is it?
The Stacked Esembles is an algorithm that combines
- multiple, (typically) diverse prediction methods (learning algorithms) called base learners (first-level) into a
- a second-level metalearner - which can be seen as a single method.
Development:
- 1992: stacking introduce for neural nets by Wolpert
- 1996: adapted to regression problems by Breiman - but only for one type of methods at once (CART with different number of terminal nodes, GLMs with subset selection, ridge regression with different ridge penalty parameters) Breiman (1996)
- 2006: proven to have asymptotic theoretical oracle property by Laan, Polley, and Hubbard (2007)
- 2015: extensions in phd thesis by Erin LeDell LeDell (2015)
Ingredients:
Training data (level-zero data) \(O_i=(X_i,Y_i)\) of \(N\) i.i.d observations.
A total of \(L\) base learning algorithms \(\Psi^l\) for \(l=1,\ldots,L\), each from some algorithmic class and each with a specific set of model parameters.
A metalearner \({\boldsymbol \Phi}\) is used to find an optimal combination of the \(L\) base learners.
Algorithm
Step 1: Produce level-one data \({\boldsymbol Z}\)
Divide the training data \({\boldsymbol X}\) randomly into \(V\) roughly-equally sized validation folds \({\boldsymbol X}_{(1)},\ldots,{\boldsymbol X}_{(V)}\). \(V\) is often 5 or 10. (The responses \({\boldsymbol Y}\) are also needed.)
For each base learner \(\Psi^l\) perform \(V\)-fold cross-validation to produce prediction.
This gives the level-one data set \({\boldsymbol Z}\) consisting prediction of all the level-zero data - that is a matrix with \(N\) rows and \(L\) columns.
What could the base learners be?
“Any” method that produces a prediction - “all” types of problems.
- linear regression
- lasso
- cart
- random forest with mtry=value 1
- random forest with mtry=value 2
- xgboost with hyperparameter set 1
- xgboost with hyperparameter set 2
- neural net with hyperparameter set 1
Step 2: Fit the metalearner
- The starting point is the level-one prediction data \({\boldsymbol Z}\) together with the responses \((Y_1,\ldots ,Y_N)\).
- The metalearner is used to estimate the weights given to each base learner: \(\hat{\eta_i}=\alpha_1 z_{1i}+ \cdots + \alpha_L z_{Li}\).
What could the metalearner be?
- the mean (bagging)
- constructed by minimizing the
- squared loss (ordinary least squares) or
- non-negative least squares (most popular)
- ridge or lasso regression
- logistic regression (for binary classification)
- constructed by minimizing 1-ROC-AUC
(Class notes: Study Figure 3.2 from Polley, Rose, and Laan (2011) and/or Figure 1 from Laan, Polley, and Hubbard (2007))
The metalearning
Some observations
- The term discrete super learner is used if the base learner with the lowest risk (i.e. CV-error) is selected.
- Since the predictions from multiple base learners may be highly correlated - the chosen method should perform well in that case (i.e. ridge and lasso).
- When minimizing the squared loss it has been found that adding a non-negativity constraint \(\alpha_l\le 0\) works well,
- and also the additivity constraint \(\sum_{l=1}^L \alpha_l=1\) - the ensemble is a convex combination of the base learners.
- Non-linear optimization methods may be employed for the metalearner if no existing algorithm is available
- Historically a regularized linear model has “mostly” been used
- For classification the logistic response function can be used on the linear combination of base learners (Figure 3.2 Polley, Rose, and Laan (2011)).
Examples
Simulation examples
(Class notes: Study Figure 3.3 and Table 3.2 from Polley, Rose, and Laan (2011))
Real data
(Class notes: Study Figure 3.4 and Table 3.3 from P@Polley2011. RE=MSE relative to the linear model OLS.)
Theoretical result
LeDell (2015) (page 6)
Oracle selector: the estimator among all possible weighted combinations of the base prediction function that minimizes the risk under the true data generating distribution.
The oracle result was established for the Super Learner by Laan, Polley, and Hubbard (2007)
If the true prediction function cannot be represented by a combination of the base learners (available), then “optimal” will be the closest linear combination that would be optimal if the true data-generating function was known.
The oracle result require an uniformly bounded loss function. Using the convex restriction (sum alphas =1) implies that if each based learner is bounded so is the convex combination. In practice: truncation of the predicted values to the range of the outcome in the training set is sufficient to allow for unbounded loss functions
Uncertainty in the ensemble
(Class notes: Study “Road map” 2 from Polley, Rose, and Laan (2011))
Add an outer (external) cross validation loop (where the super learner loop is inside). Suggestion: use 20-fold, especially when small sample size.
Overfitting? Check if the super learner does as well or better than any of the base learners in the ensemble.
Results using influence functions for estimation of the variance for the Super Learner are based on asymptotic variances in the use of \(V\)-fold cross-validation (see Ch 5.3 of LeDell (2015))
Other issues
Many different implementations available, and much work on parallell processing and speed and memory efficient execution.
Super Learner implicitly can handle hyperparameter tuning by including the same base learner with different model parameter sets in the ensemble.
Speed and memory improvements for large data sets involves subsampling, and the R
subsemble
package is one solution, the H2o package another.
R example from Superlearner package
Comment - this package is still in use, but the h2o-superlearner might be more “easy” to use.
Code is copied from Guide to SuperLearner and the presentation follows this guide. The data used is the Boston housing dataset from MASS
, but with the median value of a house dichotomized into a classification problem.
Observe that only 150 of the 560 observations is used (to speed up things, but of cause that gives less accurate results).
Code
data(Boston, package = "MASS")
#colSums(is.na(Boston)) # no missing values
= Boston$medv
outcome # Create a dataframe to contain our explanatory variables.
= subset(Boston, select = -medv)
data #Set a seed for reproducibility in this random sampling.
set.seed(1)
# Reduce to a dataset of 150 observations to speed up model fitting.
= sample(nrow(data), 150)
train_obs # X is our training sample.
= data[train_obs, ]
x_train # Create a holdout set for evaluating model performance.
# Note: cross-validation is even better than a single holdout sample.
= data[-train_obs, ]
x_holdout # Create a binary outcome variable: towns in which median home value is > 22,000.
= as.numeric(outcome > 22)
outcome_bin = outcome_bin[train_obs]
y_train = outcome_bin[-train_obs]
y_holdout table(y_train, useNA = "ifany")
y_train
0 1
92 58
Then checking out the possible functions and how they differ from their “original versions”.
Code
listWrappers()
[1] "SL.bartMachine" "SL.bayesglm" "SL.biglasso"
[4] "SL.caret" "SL.caret.rpart" "SL.cforest"
[7] "SL.earth" "SL.extraTrees" "SL.gam"
[10] "SL.gbm" "SL.glm" "SL.glm.interaction"
[13] "SL.glmnet" "SL.ipredbagg" "SL.kernelKnn"
[16] "SL.knn" "SL.ksvm" "SL.lda"
[19] "SL.leekasso" "SL.lm" "SL.loess"
[22] "SL.logreg" "SL.mean" "SL.nnet"
[25] "SL.nnls" "SL.polymars" "SL.qda"
[28] "SL.randomForest" "SL.ranger" "SL.ridge"
[31] "SL.rpart" "SL.rpartPrune" "SL.speedglm"
[34] "SL.speedlm" "SL.step" "SL.step.forward"
[37] "SL.step.interaction" "SL.stepAIC" "SL.svm"
[40] "SL.template" "SL.xgboost"
[1] "All"
[1] "screen.corP" "screen.corRank" "screen.glmnet"
[4] "screen.randomForest" "screen.SIS" "screen.template"
[7] "screen.ttest" "write.screen.template"
Code
# how does SL.glm differ from glm? obsWeight added to easy use the traning fold in the CV and returns a prediction for new observarions
SL.glm
function (Y, X, newX, family, obsWeights, model = TRUE, ...)
{
if (is.matrix(X)) {
X = as.data.frame(X)
}
fit.glm <- glm(Y ~ ., data = X, family = family, weights = obsWeights,
model = model)
if (is.matrix(newX)) {
newX = as.data.frame(newX)
}
pred <- predict(fit.glm, newdata = newX, type = "response")
fit <- list(object = fit.glm)
class(fit) <- "SL.glm"
out <- list(pred = pred, fit = fit)
return(out)
}
<bytecode: 0x1218fb200>
<environment: namespace:SuperLearner>
Code
# min and not 1sd used, again obsWeights, make sure model matrix correctly specified
SL.glmnet
function (Y, X, newX, family, obsWeights, id, alpha = 1, nfolds = 10,
nlambda = 100, useMin = TRUE, loss = "deviance", ...)
{
.SL.require("glmnet")
if (!is.matrix(X)) {
X <- model.matrix(~-1 + ., X)
newX <- model.matrix(~-1 + ., newX)
}
fitCV <- glmnet::cv.glmnet(x = X, y = Y, weights = obsWeights,
lambda = NULL, type.measure = loss, nfolds = nfolds,
family = family$family, alpha = alpha, nlambda = nlambda,
...)
pred <- predict(fitCV, newx = newX, type = "response", s = ifelse(useMin,
"lambda.min", "lambda.1se"))
fit <- list(object = fitCV, useMin = useMin)
class(fit) <- "SL.glmnet"
out <- list(pred = pred, fit = fit)
return(out)
}
<bytecode: 0x121958710>
<environment: namespace:SuperLearner>
The fitting lasso to check what is being done. The default metalearner is “method.NNLS” (both for regression and two-class classification - probably then for linear predictor NNLS?).
Code
set.seed(1)
=SuperLearner(Y=y_train, X=x_train,family=binomial(),SL.library="SL.glmnet")
sl_lasso sl_lasso
Call:
SuperLearner(Y = y_train, X = x_train, family = binomial(), SL.library = "SL.glmnet")
Risk Coef
SL.glmnet_All 0.08484849 1
Code
#str(sl_lasso)
$cvRisk sl_lasso
SL.glmnet_All
0.08484849
Now use lasso and randomforest, and also add the average of ys just as the benchmark.
Code
set.seed(1)
=SuperLearner(Y=y_train, X=x_train,family=binomial(),SL.library=c("SL.mean","SL.glmnet","SL.randomForest"))
sl sl
Call:
SuperLearner(Y = y_train, X = x_train, family = binomial(), SL.library = c("SL.mean",
"SL.glmnet", "SL.randomForest"))
Risk Coef
SL.mean_All 0.23773937 0.000000
SL.glmnet_All 0.08725786 0.134252
SL.randomForest_All 0.07213058 0.865748
Code
$times$everything sl
user system elapsed
1.655 0.032 1.690
Our ensemble give weight 0.13 to lasso and 0.86 to the random forest. (The guide used a different implementation of the random forest called ranger, and got 0.02 and 0.98.)
Predict on the part of the dataset not used for the training.
Code
=predict(sl,x_holdout=x_holdout,onlySL=TRUE)
predstr(pred)
List of 2
$ pred : num [1:150, 1] 0.3029 0.07 0.97847 0.00726 0.00523 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:150] "505" "324" "167" "129" ...
.. ..$ : NULL
$ library.predict: num [1:150, 1:3] 0.387 0.387 0.387 0.387 0.387 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:150] "505" "324" "167" "129" ...
.. ..$ : chr [1:3] "SL.mean_All" "SL.glmnet_All" "SL.randomForest_All"
Code
summary(pred$pred)
V1
Min. :0.0003034
1st Qu.:0.0183955
Median :0.1135270
Mean :0.3855066
3rd Qu.:0.9036164
Max. :0.9955802
Code
summary(pred$library.predict)
SL.mean_All SL.glmnet_All SL.randomForest_All
Min. :0.3867 Min. :0.0000014 Min. :0.0000
1st Qu.:0.3867 1st Qu.:0.0244935 1st Qu.:0.0160
Median :0.3867 Median :0.2063204 Median :0.1020
Mean :0.3867 Mean :0.3866667 Mean :0.3853
3rd Qu.:0.3867 3rd Qu.:0.8169726 3rd Qu.:0.9123
Max. :0.3867 Max. :0.9997871 Max. :0.9980
Add now an external cross-validation loop - only using the training data. Here the default \(V=10\) is used for the inner loop, and we set the value for the outer loop (here \(V=3\) for speed).
Code
system.time({cv_sl=CV.SuperLearner(Y=y_train, X=x_train,V=10,family=binomial(),SL.library=c("SL.mean","SL.glmnet","SL.randomForest"))})
user system elapsed
15.003 0.179 15.181
Code
summary(cv_sl)
Call:
CV.SuperLearner(Y = y_train, X = x_train, V = 10, family = binomial(), SL.library = c("SL.mean",
"SL.glmnet", "SL.randomForest"))
Risk is based on: Mean Squared Error
All risk estimates are based on V = 10
Algorithm Ave se Min Max
Super Learner 0.077041 0.0123412 0.0223396 0.12156
Discrete SL 0.079783 0.0132389 0.0245396 0.12151
SL.mean_All 0.242535 0.0093204 0.1947874 0.29619
SL.glmnet_All 0.088109 0.0152056 0.0098891 0.14402
SL.randomForest_All 0.073960 0.0115466 0.0245396 0.12151
See the guide for more information on running multiple versions of one base learner, and parallellisation.
R example from H2o-package
https://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/stacked-ensembles.html
Python examples available from the same page
The Higgs boson data is used - but which version is not specified, maybe this https://archive.ics.uci.edu/ml/datasets/HIGGS or a specifically made data set. The problem is binary, so maybe to detect signal vs noise.
Default metalearner: Options include ‘AUTO’ (GLM with non negative weights; if validation_frame is present, a lambda search is performed)
Code
h2o.init()
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 1 days 3 hours
H2O cluster timezone: Europe/Oslo
H2O data parsing timezone: UTC
H2O cluster version: 3.40.0.1
H2O cluster version age: 25 days
H2O cluster name: H2O_started_from_R_mettela_bze126
H2O cluster total nodes: 1
H2O cluster total memory: 3.01 GB
H2O cluster total cores: 10
H2O cluster allowed cores: 10
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
R Version: R version 4.2.2 (2022-10-31)
Code
# Import a sample binary outcome train/test set into H2O
<- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv") train
|
| | 0%
|
|======================== | 35%
|
|================================================================= | 92%
|
|======================================================================| 100%
Code
<- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv") test
|
| | 0%
|
|=================================================== | 72%
|
|======================================================================| 100%
Code
# Identify predictors and response
<- "response"
y <- setdiff(names(train), y)
x
# For binary classification, response should be a factor
<- as.factor(train[, y])
train[, y] <- as.factor(test[, y])
test[, y]
print(dim(train))
[1] 10000 29
Code
print(colnames(train))
[1] "response" "x1" "x2" "x3" "x4" "x5"
[7] "x6" "x7" "x8" "x9" "x10" "x11"
[13] "x12" "x13" "x14" "x15" "x16" "x17"
[19] "x18" "x19" "x20" "x21" "x22" "x23"
[25] "x24" "x25" "x26" "x27" "x28"
Code
print(dim(test))
[1] 5000 29
Code
# Number of CV folds (to generate level-one data for stacking)
<- 5
nfolds
# There are a few ways to assemble a list of models to stack toegether:
# 1. Train individual models and put them in a list
# 1. Generate a 2-model ensemble (GBM + RF)
# Train & Cross-validate a GBM
<- h2o.gbm(x = x,
my_gbm y = y,
training_frame = train,
distribution = "bernoulli",
ntrees = 10,
max_depth = 3,
min_rows = 2,
learn_rate = 0.2,
nfolds = nfolds,
keep_cross_validation_predictions = TRUE,
seed = 1)
|
| | 0%
|
|============================================ | 63%
|
|======================================================================| 100%
Code
# Train & Cross-validate a RF
<- h2o.randomForest(x = x,
my_rf y = y,
training_frame = train,
ntrees = 50,
nfolds = nfolds,
keep_cross_validation_predictions = TRUE,
seed = 1)
|
| | 0%
|
|====== | 9%
|
|================== | 25%
|
|============================= | 41%
|
|========================================= | 58%
|
|==================================================== | 74%
|
|=============================================================== | 90%
|
|======================================================================| 100%
Now the default metalearner
AUTO: glm with non-negative weights
Code
# Train a stacked ensemble using the GBM and RF above
<- h2o.stackedEnsemble(x = x,
ensemble y = y,
training_frame = train,
base_models = list(my_gbm, my_rf))
|
| | 0%
|
|======================================================================| 100%
Code
# default metalearner_transform should be NONE
#print(summary(ensemble))
#ensemble@model
# Eval ensemble performance on a test set
<- h2o.performance(ensemble, newdata = test)
perf
# Compare to base learner performance on the test set
<- h2o.performance(my_gbm, newdata = test)
perf_gbm_test <- h2o.performance(my_rf, newdata = test)
perf_rf_test <- max(h2o.auc(perf_gbm_test), h2o.auc(perf_rf_test))
baselearner_best_auc_test <- h2o.auc(perf)
ensemble_auc_test print(sprintf("Best Base-learner Test AUC: %s", baselearner_best_auc_test))
[1] "Best Base-learner Test AUC: 0.769204725074508"
Code
print(sprintf("Ensemble Test AUC: %s", ensemble_auc_test))
[1] "Ensemble Test AUC: 0.773144298176816"
Code
# [1] "Best Base-learner Test AUC: 0.76979821502548"
# [1] "Ensemble Test AUC: 0.773501212640419"
# Generate predictions on a test set (if neccessary)
<- h2o.predict(ensemble, newdata = test) pred
|
| | 0%
|
|======================================================================| 100%
Code
print(head(pred))
predict p0 p1
1 0 0.6839178 0.3160822
2 1 0.5825428 0.4174572
3 1 0.5869431 0.4130569
4 1 0.1927212 0.8072788
5 1 0.4509384 0.5490616
6 1 0.3275080 0.6724920
Code
metalearner_model:
Model Details==============
: glm
H2OBinomialModel: metalearner_AUTO_StackedEnsemble_model_R_1677945156774_1824
Model ID: summary
GLM Model
family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations1 binomial logit Elastic Net (alpha = 0.5, lambda = 8.399E-5 ) 2 2 3
training_frame1 levelone_training_StackedEnsemble_model_R_1677945156774_1824
: glm coefficients
Coefficients
names coefficients standardized_coefficients1 Intercept -3.603549 0.149102
2 GBM_model_R_1677945156774_1086 3.298011 0.493334
3 DRF_model_R_1677945156774_1214 3.809905 0.701246
Code
# Train a stacked ensemble using the GBM and RF above
<- h2o.stackedEnsemble(x = x,
ensemble y = y,
training_frame = train,
base_models = list(my_gbm, my_rf),
metalearner_transform = "Logit")
|
| | 0%
|
|======================================================================| 100%
Code
#print(summary(ensemble))
#print(ensemble@model)
# Eval ensemble performance on a test set
<- h2o.performance(ensemble, newdata = test)
perf
# Compare to base learner performance on the test set
<- h2o.performance(my_gbm, newdata = test)
perf_gbm_test <- h2o.performance(my_rf, newdata = test)
perf_rf_test <- max(h2o.auc(perf_gbm_test), h2o.auc(perf_rf_test))
baselearner_best_auc_test <- h2o.auc(perf)
ensemble_auc_test print(sprintf("Best Base-learner Test AUC: %s", baselearner_best_auc_test))
[1] "Best Base-learner Test AUC: 0.769204725074508"
Code
print(sprintf("Ensemble Test AUC: %s", ensemble_auc_test))
[1] "Ensemble Test AUC: 0.773096033881535"
Code
# Generate predictions on a test set (if neccessary)
<- h2o.predict(ensemble, newdata = test) pred
|
| | 0%
|
|======================================================================| 100%
Code
print(head(pred))
predict p0 p1
1 0 0.6739209 0.3260791
2 1 0.5814741 0.4185259
3 1 0.5826643 0.4173357
4 1 0.1971804 0.8028196
5 1 0.4561659 0.5438341
6 1 0.3365841 0.6634159
Code
$metalearner_model
:
Model Details==============
: glm
H2OBinomialModel: metalearner_AUTO_StackedEnsemble_model_R_1677945156774_1830
Model ID: summary
GLM Model
family link regularization number_of_predictors_total number_of_active_predictors number_of_iterations1 binomial logit Elastic Net (alpha = 0.5, lambda = 3.885E-4 ) 2 2 3
training_frame1 levelone_training_StackedEnsemble_model_R_1677945156774_1830
: glm coefficients
Coefficients
names coefficients standardized_coefficients1 Intercept -0.053725 0.154528
2 GBM_model_R_1677945156774_1086 0.791767 0.515081
3 DRF_model_R_1677945156774_1214 0.845217 0.731991