# Comparing Random Forest and Bagging

I recently read an interesting paper on *Bagging* [1]. The researchers compared *Bagging* and *Random Subspace* (RS) with *Random Forest* (RF). Their results were that by combining *Bagging* with RS, one can achieve a comparable performance to that of RF. They called their algorithm *SubBag*.

Luckily, scikit-learn implements *Bagging* the way it was done in [2]. We can therefore reproduce their experiment.

## Bagging

For our experiment we need scikit-learn’s classes `BaggingClassifier`

and `BaggingRegressor`

. The relevant parameters are: `max_samples`

, `bootstrap`

and `max_features`

, `bootstrap_features`

.

`bootstrap=True`

, `max_samples=0.632`

, `bootstrap_features=True`

and `max_samples=0.75`

will give us *SubBag*.

To understand what scikit-learn does internally, one can look at the function `_generate_indices`

[3]. For example, if we set `bootstrap=True`

and `max_samples=0.632`

, scikit-learn will call `random_state.randint`

with `n_population=n`

and `n_samples=`

.

## Why max_samples=0.632?

By setting `bootstrap=True`

, we draw samples with replacement. Hence, not all samples will be unique in a bootstrap sample. We now want to set `max_samples`

to the average proportion of unique samples in a bootstrap sample.

According to [1], the advantage of setting this value is that possible outliers in the training set sometimes do not show up in the bootstrap sample.

### Derivation 1

Let us first derive the value 0.632 with some probability theory. We start by calculating the expected number of unique samples in draws with replacement.

Then the expected number can be calculated by summing i.e.

Next, we divide the number of unique samples by the total number of samples to obtain the ratio of unique samples.

And taking the limit, yields

### Derivation 2

Assume . Then by the Poisson limit theorem,

We have obtained the probability to get times sample in draws. Thus,

and we get the same result as before.

### Derivation 3

Plotting the ratio of unique samples is another possibility to derive this number.

And the plot shows us again .

## What happens if we set max_samples=0.632?

Let us plot the ratio of unique samples with various values of `max_samples`

.

If we set `max_samples=0.632`

, about samples will be unique in a bootstrap sample. The lower the value of `max_samples`

, the higher the probability of unique samples.

## Tests

I chose the following datasets for the tests (all included in scikit-learn):

- load_boston (506, 13)
- load_diabetes (442, 10)
- load_iris (150, 4)
- load_digits (1797, 64)
- load_wine (178, 13)
- load_breast_cancer (569, 30)

Besides setting a `random_state`

, I did not change any parameters. Training was done with a 5-fold cross-validation.

Dataset | RF | bootstrap, max_samples=0.632 | SubBag | bootstrap | bootstrap, bootstrap_features |
---|---|---|---|---|---|

boston | 0.61 | 0.65 |
0.57 | 0.62 | 0.60 |

diabetes | 0.37 | 0.39 | 0.36 | 0.37 | 0.40 |

iris | 0.95 |
0.95 |
0.94 | 0.95 |
0.95 |

digits | 0.90 |
0.88 | 0.89 | 0.88 | 0.90 |

wine | 0.97 |
0.93 | 0.95 | 0.93 | 0.97 |

breast_cancer | 0.95 |
0.95 |
0.95 |
0.93 | 0.95 |

According to the table, the best models are:

`bootstrap=True`

,`bootstrap_features=True`

,`max_samples=1.0`

,`max_features=1.0`

- Random forest
`bootstrap=True`

,`max_samples=0.632`

,`bootstrap_features=False`

Maybe SubBag performed so badly because we set `max_features=0.75`

. Furthermore, I think the datasets were too small. With greater , `max_samples=0.632`

would have a different effect.

If we only look at each fold, the performance seems to be more similar.

## Discussion

The difference between RF and `bootstrap_features=True`

is that RF draws locally subsets of features. `bootstrap_features=True`

selects the features globally prior to the construction of the tree.

In the tests shown here RF was outperformed by global selection of features. Other papers like [2] show similar results. According to [2], picking features globally can be better for larger datasets.

In general, I think it’s a good idea to set `bootstrap=True`

and `bootstrap_features=True`

. The best values for `max_samples`

and `max_features`

depend on the dataset. `max_samples=0.632`

will not always achieve good results. It is perhaps better to keep the value at .

## References

[1] P. Panov and S. Džeroski, “Combining Bagging and Random Subspaces to Create Better Ensembles,” Advances in Intelligent Data Analysis VII, pp. 118–129, 2007.

[2] G. Louppe and P. Geurts, “Ensembles on Random Patches,” Lecture Notes in Computer Science, pp. 346–361, 2012.

[3] https://github.com/scikit-learn/scikit-learn/blob/a24c8b46/sklearn/ensemble/bagging.py#L40

## Comments