Comparing Random Forest and Bagging

3 minute read

I recently read an interesting paper on Bagging [1]. The researchers compared Bagging and Random Subspace (RS) with Random Forest (RF). Their results were that by combining Bagging with RS, one can achieve a comparable performance to that of RF. They called their algorithm SubBag.

Luckily, scikit-learn implements Bagging the way it was done in [2]. We can therefore reproduce their experiment.

Bagging

For our experiment we need scikit-learn’s classes BaggingClassifier and BaggingRegressor. The relevant parameters are: max_samples, bootstrap and max_features, bootstrap_features.

bootstrap=True, max_samples=0.632, bootstrap_features=True and max_samples=0.75 will give us SubBag.

To understand what scikit-learn does internally, one can look at the function _generate_indices [3]. For example, if we set bootstrap=True and max_samples=0.632, scikit-learn will call random_state.randint with n_population=n and n_samples=.

Why max_samples=0.632?

By setting bootstrap=True, we draw samples with replacement. Hence, not all samples will be unique in a bootstrap sample. We now want to set max_samples to the average proportion of unique samples in a bootstrap sample.

According to [1], the advantage of setting this value is that possible outliers in the training set sometimes do not show up in the bootstrap sample.

Derivation 1

Let us first derive the value 0.632 with some probability theory. We start by calculating the expected number of unique samples in draws with replacement.

Then the expected number can be calculated by summing i.e.

Next, we divide the number of unique samples by the total number of samples to obtain the ratio of unique samples.

And taking the limit, yields

Derivation 2

Assume . Then by the Poisson limit theorem,

We have obtained the probability to get times sample in draws. Thus,

and we get the same result as before.

Derivation 3

Plotting the ratio of unique samples is another possibility to derive this number.

import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt

s = np.random.RandomState(1)

unique = []
for i in range(10, 10000):
    a = s.randint(0, i, int(i))
    unique.append(len(set(a)) / len(a))

sns.set()
plt.plot(unique)
plt.ylabel("unique")
plt.savefig("plot.svg", format="svg")

And the plot shows us again .

Plot

What happens if we set max_samples=0.632?

Let us plot the ratio of unique samples with various values of max_samples.

Plot

If we set max_samples=0.632, about samples will be unique in a bootstrap sample. The lower the value of max_samples, the higher the probability of unique samples.

Tests

I chose the following datasets for the tests (all included in scikit-learn):

  • load_boston (506, 13)
  • load_diabetes (442, 10)
  • load_iris (150, 4)
  • load_digits (1797, 64)
  • load_wine (178, 13)
  • load_breast_cancer (569, 30)

Besides setting a random_state, I did not change any parameters. Training was done with a 5-fold cross-validation.

Dataset RF bootstrap, max_samples=0.632 SubBag bootstrap bootstrap, bootstrap_features
boston 0.61 0.65 0.57 0.62 0.60
diabetes 0.37 0.39 0.36 0.37 0.40
iris 0.95 0.95 0.94 0.95 0.95
digits 0.90 0.88 0.89 0.88 0.90
wine 0.97 0.93 0.95 0.93 0.97
breast_cancer 0.95 0.95 0.95 0.93 0.95

According to the table, the best models are:

  1. bootstrap=True, bootstrap_features=True, max_samples=1.0, max_features=1.0
  2. Random forest
  3. bootstrap=True, max_samples=0.632, bootstrap_features=False

Maybe SubBag performed so badly because we set max_features=0.75. Furthermore, I think the datasets were too small. With greater , max_samples=0.632 would have a different effect.

If we only look at each fold, the performance seems to be more similar.

Plot

Discussion

The difference between RF and bootstrap_features=True is that RF draws locally subsets of features. bootstrap_features=True selects the features globally prior to the construction of the tree.

In the tests shown here RF was outperformed by global selection of features. Other papers like [2] show similar results. According to [2], picking features globally can be better for larger datasets.

In general, I think it’s a good idea to set bootstrap=True and bootstrap_features=True. The best values for max_samples and max_features depend on the dataset. max_samples=0.632 will not always achieve good results. It is perhaps better to keep the value at .

References

[1] P. Panov and S. Džeroski, “Combining Bagging and Random Subspaces to Create Better Ensembles,” Advances in Intelligent Data Analysis VII, pp. 118–129, 2007.

[2] G. Louppe and P. Geurts, “Ensembles on Random Patches,” Lecture Notes in Computer Science, pp. 346–361, 2012.

[3] https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/bagging.py#L40

Comments