H2O Ensemble: Stacking in H2O
若你不能成功安装这个版本不要纠结,你可以看第二篇译文,但我建议你先浏览一遍这篇文章
H2O Ensemble已经实现成为一个成为h2oEnsemble的独立R包。该包是h2o这个包的扩展,它允许用户在h2o集群上使用任意的h2o监督学习算法来训练一个集成模型。在h2o这个R包中,h2oEnsemble中的所有计算实际上都在H2O集群内部执行,而不是在R内存中执行。
Super Learner集成算法中的主要计算任务是初级学习器与次级学习器的训练和交叉验证。因此,在R中(而不是在Java中)实现集成的“plumbing”不会导致性能的损失。所有的训练和数据处理都在高性能H2O集群中进行。
H2O Ensemble目前只支持回归和二分类任务, 将在以后的版本中添加多分类支持。
译者注:最新版的h2o包运行下面代码会报错,建议按照老版本。按装老版本代码如下,可能会有点慢(h2o这个包有50M),而且h2o包运行需要java环境。
install.packages("h2o", type="source", repos=(c("http://h2o-release.s3.amazonaws.com/h2o/rel-turchin/9/R")))
安装 H2O Ensemble
为了安装 h2oEnsemble包,你只需要按照README文件中的安装说明,这也是为了方便起见。
H2O R Package
首先,你需要安装H2O R包,如果你还没有安装它。R安装说明参见:http://h2o.ai/download
H2O Ensemble R Package
推荐的h2oEnsemble R软件包的安装方式是直接从GitHub使用devtools软件包。(H2O World教程参加者可以从提供的U盘安装软件包)。
从GitHub上进行安装
library(devtools)
install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")
Higgs Demo
这是一个使用h2o.ensemble
函数的二分类例子,h2o.ensemble
是 h2oEnsemble包里的一个函数。这个演示使用的是 HIGGS dataset数据集的子集,有28个数值特征和一个二分类响应变量,在该示例中的机器学习任务是区分产生Higgs 玻色子(Y = 1)的和不产生玻色子的背景(Y = 0)。数据集的正例反例大致相同,也就是说这是一个类别平衡的数据集。
如果从纯R运行,请在此脚本的目录中执行R。如果从RStudio运行,请确保setwd()
到此脚本的位置。 h2o.init()
在R的当前工作目录中启动H2O。 h2o.importFile()
是h2o中的文件导入函数。
开启h2o集群
library(h2oEnsemble) # This will load the `h2o` R package as well
h2o.init(nthreads = -1,enable_assertions = FALSE) # Start an H2O cluster with nthreads = num cores on your machine,-1 means use all CPUs on the host
h2o.removeAll() # (Optional) Remove all objects in H2O cluster
导入数据
首先导入训练集和测试集
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_5k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
y <- "response"
x <- setdiff(names(train), y)
family <- "binomial"
对于二分类问题,响应变量应该是一个factor 类型(在JAVA中为 enum类型,Python中的Pandas为categorial类型),用户可以在使用h2o.importFile
函数时指定列的类型,你也可以按照如下方法指定列类型:
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])
指定初级学习器与次级学习器
在这里,我们将使用h2o.ensemble
的默认初级学习器库,默认的函数包括GLM, Random Forest, GBM and Deep Neural Net (所有模型使用默认的参数)。同时,次级学习器我们也使用默认的 H2O GLM。
learner <- c("h2o.glm.wrapper", "h2o.randomForest.wrapper",
"h2o.gbm.wrapper", "h2o.deeplearning.wrapper")
metalearner <- "h2o.glm.wrapper"
训练一个集成模型
使用5折交叉验证进行训练来产生level-one数据。值得注意的是,使用更多的折会消耗更多的时间,但也许会提高性能。
fit <- h2o.ensemble(x = x, y = y,
training_frame = train,
family = family,
learner = learner,
metalearner = metalearner,
cvControl = list(V = 5))
评估模型性能
由于响应变量是二分类的,我们可以使用ROC曲线下面积(AUC)来评估模型性能。 计算测试集性能,并按AUC(二项分类的默认度量)排序:
perf <- h2o.ensemble_performance(fit, newdata = test)
输出各个初级学习器的性能与集成模型的性能:
> perf
Base learner performance, sorted by specified metric:
learner AUC
1 h2o.glm.wrapper 0.6824304
4 h2o.deeplearning.wrapper 0.7006335
2 h2o.randomForest.wrapper 0.7570211
3 h2o.gbm.wrapper 0.7780807
H2O Ensemble Performance on <newdata>:
----------------
Family: binomial
Ensemble performance (AUC): 0.781580655670451
我们可以比较整体的性能与个体学习器在整体中的表现。
我们可以看到最好的单模型是GBM,在测试集上的AUC为0.778,而集成以后的得分为0.7815。起初认为这点提高似乎不太多,但在许多行业,如医药或金融,这个小优势是非常有价值的。
为了提高集成的性能,我们有几个选择。
- 通过
cvControl
参数来增加交叉验证的折数。 - 改变初级学习器与次级学习器。
注意,上面的集成结果是不可重现的,因为 h2o.deeplearning
在使用多个核时结果不可重现,并且我们没有为 h2o.randomForest.wrapper
设置随机种子。
如果你想使用不同的评测方式,比如说"MSE",我们可以通过 print
函数来实现。
> print(perf, metric = "MSE")
Base learner performance, sorted by specified metric:
learner MSE
4 h2o.deeplearning.wrapper 0.2305775
1 h2o.glm.wrapper 0.2225176
2 h2o.randomForest.wrapper 0.2014339
3 h2o.gbm.wrapper 0.1916273
H2O Ensemble Performance on <newdata>:
----------------
Family: binomial
Ensemble performance (MSE): 0.1898735479034431
Predict
如果你需要生成预测值(而不是只看模型性能),你可以在测试集上使用predict
函数。
pred <- predict(fit, newdata = test)
如果需要将预测值返回R内存中进行进一步处理,可以将ped
转换为本地R 的数据框,如下所示:
predictions <- as.data.frame(pred$pred)[,3] #third column is P(Y==1)
labels <- as.data.frame(test[,y])[,1]
h2o.ensemble
拟合的predict
方法将返回一个列表,它包含两个对象。 pred$pred
对象包含的是集成的预测结果, pred$basepred
返回的是一个矩阵,包含每个初级学习器的预测值。在这个例子中,我们使用了4个初级学习器,所以pred$basepred
返回的矩阵包含4列。
指定新的学习器
现在让我们再试一下更多的基学习器。h2oEnsemble
包默认有四个函数,可以自定义使用非默认参数。
这里是如何生成自定义学习器的示例:
h2o.glm.1 <- function(..., alpha = 0.0) h2o.glm.wrapper(..., alpha = alpha)
h2o.glm.2 <- function(..., alpha = 0.5) h2o.glm.wrapper(..., alpha = alpha)
h2o.glm.3 <- function(..., alpha = 1.0) h2o.glm.wrapper(..., alpha = alpha)
h2o.randomForest.1 <- function(..., ntrees = 200, nbins = 50, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
h2o.randomForest.2 <- function(..., ntrees = 200, sample_rate = 0.75, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, sample_rate = sample_rate, seed = seed)
h2o.randomForest.3 <- function(..., ntrees = 200, sample_rate = 0.85, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, sample_rate = sample_rate, seed = seed)
h2o.randomForest.4 <- function(..., ntrees = 200, nbins = 50, balance_classes = TRUE, seed = 1) h2o.randomForest.wrapper(..., ntrees = ntrees, nbins = nbins, balance_classes = balance_classes, seed = seed)
h2o.gbm.1 <- function(..., ntrees = 100, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, seed = seed)
h2o.gbm.2 <- function(..., ntrees = 100, nbins = 50, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, nbins = nbins, seed = seed)
h2o.gbm.3 <- function(..., ntrees = 100, max_depth = 10, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, max_depth = max_depth, seed = seed)
h2o.gbm.4 <- function(..., ntrees = 100, col_sample_rate = 0.8, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
h2o.gbm.5 <- function(..., ntrees = 100, col_sample_rate = 0.7, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
h2o.gbm.6 <- function(..., ntrees = 100, col_sample_rate = 0.6, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, col_sample_rate = col_sample_rate, seed = seed)
h2o.gbm.7 <- function(..., ntrees = 100, balance_classes = TRUE, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, balance_classes = balance_classes, seed = seed)
h2o.gbm.8 <- function(..., ntrees = 100, max_depth = 3, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, max_depth = max_depth, seed = seed)
h2o.deeplearning.1 <- function(..., hidden = c(500,500), activation = "Rectifier", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.2 <- function(..., hidden = c(200,200,200), activation = "Tanh", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.3 <- function(..., hidden = c(500,500), activation = "RectifierWithDropout", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.4 <- function(..., hidden = c(500,500), activation = "Rectifier", epochs = 50, balance_classes = TRUE, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, balance_classes = balance_classes, seed = seed)
h2o.deeplearning.5 <- function(..., hidden = c(100,100,100), activation = "Rectifier", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.6 <- function(..., hidden = c(50,50), activation = "Rectifier", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
h2o.deeplearning.7 <- function(..., hidden = c(100,100), activation = "Rectifier", epochs = 50, seed = 1) h2o.deeplearning.wrapper(..., hidden = hidden, activation = activation, seed = seed)
我们来选取基学习器一个子集,并重新训练集成模型。
自定义初级学习器
learner <- c("h2o.glm.wrapper",
"h2o.randomForest.1", "h2o.randomForest.2",
"h2o.gbm.1", "h2o.gbm.6", "h2o.gbm.8",
"h2o.deeplearning.1", "h2o.deeplearning.6", "h2o.deeplearning.7")
用新的初级学习器来进行训练:
fit <- h2o.ensemble(x = x, y = y,
training_frame = train,
family = family,
learner = learner,
metalearner = metalearner,
cvControl = list(V = 5))
评估测试集性能:
perf <- h2o.ensemble_performance(fit, newdata = test)
结果如下:
> perf
Base learner performance, sorted by specified metric:
learner AUC
1 h2o.glm.wrapper 0.6824304
7 h2o.deeplearning.1 0.6897187
8 h2o.deeplearning.6 0.6998472
9 h2o.deeplearning.7 0.7048874
2 h2o.randomForest.1 0.7668024
3 h2o.randomForest.2 0.7697849
4 h2o.gbm.1 0.7751240
6 h2o.gbm.8 0.7752852
5 h2o.gbm.6 0.7771115
H2O Ensemble Performance on <newdata>:
----------------
Family: binomial
Ensemble performance (AUC): 0.780924502576107
那么,如果我们移除一些较弱的学习器,那么会发生什么呢? 让我们从学习器中删除GLM和DL,看看会发生什么。
learner <- c("h2o.randomForest.1", "h2o.randomForest.2",
"h2o.gbm.1", "h2o.gbm.6", "h2o.gbm.8")
再次重新训练集成模型并评估性能:
fit <- h2o.ensemble(x = x, y = y,
training_frame = train,
family = family,
learner = learner,
metalearner = metalearner,
cvControl = list(V = 5))
perf <- h2o.ensemble_performance(fit, newdata = test)
实际上,移除弱学习器后我们的集成表现有所下降! 这表明了堆叠与大量和多样化的基学习器的作用。
> perf
Base learner performance, sorted by specified metric:
learner AUC
1 h2o.randomForest.1 0.7668024
2 h2o.randomForest.2 0.7697849
3 h2o.gbm.1 0.7751240
5 h2o.gbm.8 0.7752852
4 h2o.gbm.6 0.7771115
H2O Ensemble Performance on <newdata>:
----------------
Family: binomial
Ensemble performance (AUC): 0.778853964308554
首先你会想到,你可以假设去除性能较低的模型会提高系综的性能。然而,每个学习器都有自己对集成模型的独特贡献,学习器之间的多样性通常会提高性能。Stacking 算法是以优于其他结合方法的方式,将所有学习器组合在一起的优化方式。
Stacking 现有的模型集
下面为Stacking示意图:
您也可以使用h2o模型的作为起点,并使用h2o.stack()
函数将它们通过指定的次级学习器。
初级学习器必须已经在相同响应变量的相同数据集上训练,并且对于交叉验证必须已经使用相同的折数。
示例如下。 如上所述,启动H2O集群并加载训练和测试数据。
library(h2oEnsemble)
h2o.init(nthreads = -1) # Start H2O cluster using all available CPU threads
# Import a sample binary outcome train/test set into R
train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_5k.csv")
test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
y <- "response"
x <- setdiff(names(train), y)
family <- "binomial"
#For binary classification, response should be a factor
train[,y] <- as.factor(train[,y])
test[,y] <- as.factor(test[,y])
使用交叉验证训练少数基学习器,然后使用h2o.stack()
函数创建集成模型:
# The h2o.stack function is an alternative to the h2o.ensemble function, which
# allows the user to specify H2O models individually and then stack them together
# at a later time. Saved models, re-loaded from disk, can also be stacked.
# The base models must use identical cv folds; this can be achieved in two ways:
# 1. they be specified explicitly by using the fold_column argument, or
# 2. use same value for `nfolds` and set `fold_assignment = "Modulo"`
nfolds <- 5
glm1 <- h2o.glm(x = x, y = y, family = family,
training_frame = train,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
gbm1 <- h2o.gbm(x = x, y = y, distribution = "bernoulli",
training_frame = train,
seed = 1,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
rf1 <- h2o.randomForest(x = x, y = y, # distribution not used for RF
training_frame = train,
seed = 1,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
dl1 <- h2o.deeplearning(x = x, y = y, distribution = "bernoulli",
training_frame = train,
nfolds = nfolds,
fold_assignment = "Modulo",
keep_cross_validation_predictions = TRUE)
models <- list(glm1, gbm1, rf1, dl1)
metalearner <- "h2o.glm.wrapper"
stack <- h2o.stack(models = models,
response_frame = train[,y],
metalearner = metalearner,
seed = 1,
keep_levelone_data = TRUE)
# Compute test set performance:
perf <- h2o.ensemble_performance(stack, newdata = test)
输出初级学习器和集成模型在测试集上的性能:
> print(perf)
Base learner performance, sorted by specified metric:
learner AUC
1 GLM_model_R_1480128759162_16643 0.6822933
4 DeepLearning_model_R_1480128759162_18909 0.7016809
3 DRF_model_R_1480128759162_17790 0.7546005
2 GBM_model_R_1480128759162_16661 0.7780807
H2O Ensemble Performance on <newdata>:
----------------
Family: binomial
Ensemble performance (AUC): 0.781241759877087
Roadmap for H2O Ensemble
H2O Ensemble目前只能使用R API,但是,它将在未来的版本中通过我们的所有API访问。
更新:Ensembles已经在H2O Java核心中被实现为模型类,"H2OStackedEnsembleEstimator".
代码可以在h2o-3的ensembles分支上找到。 R和Python中的API即将推出。
参见h2o中的函数h2o.stackedEnsemble
关掉 H2O
h2o.shutdown()
本篇教程附带的h2o幻灯片在这里.
Github的ensembles网页在这里.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。