I recently read an interesting paper on Bagging . The researchers compared Bagging and Random Subspace (RS) with Random Forest (RF). Their results were that by combining Bagging with RS, one can achieve a comparable performance to that of RF. They called their algorithm SubBag.
Luckily, scikit-learn implements Bagging the way it was done in . We can therefore reproduce their experiment.
For our experiment we need scikit-learn’s classes
BaggingRegressor. The relevant parameters are:
max_samples=0.75 will give us SubBag.
To understand what scikit-learn does internally, one can look at the function
_generate_indices . For example, if we set
max_samples=0.632, scikit-learn will call
bootstrap=True, we draw samples with replacement. Hence, not all samples will be unique in a bootstrap sample. We now want to set
max_samples to the average proportion of unique samples in a bootstrap sample.
According to , the advantage of setting this value is that possible outliers in the training set sometimes do not show up in the bootstrap sample.
Let us first derive the value 0.632 with some probability theory. We start by calculating the expected number of unique samples in draws with replacement.
Then the expected number can be calculated by summing i.e.
Next, we divide the number of unique samples by the total number of samples to obtain the ratio of unique samples.
And taking the limit, yields
Assume . Then by the Poisson limit theorem,
We have obtained the probability to get times sample in draws. Thus,
and we get the same result as before.
Plotting the ratio of unique samples is another possibility to derive this number.
And the plot shows us again .
What happens if we set max_samples=0.632?
Let us plot the ratio of unique samples with various values of
If we set
max_samples=0.632, about samples will be unique in a bootstrap sample. The lower the value of
max_samples, the higher the probability of unique samples.
I chose the following datasets for the tests (all included in scikit-learn):
- load_boston (506, 13)
- load_diabetes (442, 10)
- load_iris (150, 4)
- load_digits (1797, 64)
- load_wine (178, 13)
- load_breast_cancer (569, 30)
Besides setting a
random_state, I did not change any parameters. Training was done with a 5-fold cross-validation.
|Dataset||RF||bootstrap, max_samples=0.632||SubBag||bootstrap||bootstrap, bootstrap_features|
According to the table, the best models are:
- Random forest
Maybe SubBag performed so badly because we set
max_features=0.75. Furthermore, I think the datasets were too small. With greater ,
max_samples=0.632 would have a different effect.
If we only look at each fold, the performance seems to be more similar.
The difference between RF and
bootstrap_features=True is that RF draws locally subsets of features.
bootstrap_features=True selects the features globally prior to the construction of the tree.
In general, I think it’s a good idea to set
bootstrap_features=True. The best values for
max_features depend on the dataset.
max_samples=0.632 will not always achieve good results. It is perhaps better to keep the value at .
 P. Panov and S. Džeroski, “Combining Bagging and Random Subspaces to Create Better Ensembles,” Advances in Intelligent Data Analysis VII, pp. 118–129, 2007.
 G. Louppe and P. Geurts, “Ensembles on Random Patches,” Lecture Notes in Computer Science, pp. 346–361, 2012.