Stacking Best Practices

6 minute read

Stacking is a popular ensemble method in data competitions, but general guidelines are nowhere to be found. Most articles just describe how Stacking works.

Based on solutions of data competitions and papers on the topic of Stacking, I tried to figure out whether there are any best practices. In this post, I will present some of my findings.

Stacking in general

Stacking can be described as follows:

  1. Generate randomly subsets of equal size .
  2. Omit one subset and use the remaining subsets to train the level 0 models. Predict on the whole test set and on (out-of-fold).
  3. Repeat times, then average the results of the prediction on the whole test set.
  4. Repeat times, then train the level 1 model on the out-of-fold (OOF) data and predict on the test set.

The process of Stacking can be described more easily with a minimal example:

def stacking(X_train, y_train, X_test, models, meta_model, n_folds):
    n_models = len(models)

    train = np.zeros((X_train.shape[0], n_models))
    test = np.zeros((X_test.shape[0], n_models))

    folds = KFold(n_splits = n_folds)

    for m, model in enumerate(models):
        test_m = np.zeros((X_test.shape[0], n_folds))
        for i, (train_index, test_index) in enumerate(folds.split(X_train, y_train)):
  [train_index], y_train[train_index])
            train[test_index, m] = model.predict(X_train[test_index])
            test_m[:, i] = model.predict(X_test)

        test[:, m] = test_m.mean(axis=1), y_train)

    return meta_model.predict(test)

boston = load_boston()
X_train, X_test, y_train, y_test = train_test_split(,, test_size=0.2)

models = [Ridge(), RandomForestRegressor()]
meta_model = LinearRegression()
y_hat = stacking(X_train, y_train, X_test, models, meta_model, 5)

However, on real data you wouldn’t use such a function. The problem is that every algorithm has “special” needs (preprocessing, feature selection etc.). Even though it is possible to use scikit-learn’s pipeline.Pipeline, in general you don’t want to refit every model. Just imagine you change the parameters of only one model, but every other model has to be regenerated as well.

Furthermore, one should always set random_state to some number so that it is possible to reproduce the results. For classification problems consider using StratifiedKFold. Shuffling is also useful for some algorithms.

Stacking on Kaggle

As an example, here is what the solution of a 2nd place data competition on Kaggle looks like [1]:

  • script 1 (level 0): train with Stacking 16 LightGBM using 5 folds, save OOF predictions to a csv file, save test set predictions to another csv file
  • script 2 (level 0): train with Stacking 5 NN using 5 folds, save again OOF/test set
  • script 3 (level 1): combine the output of file 1 and file 2 using averaging (i. e. OOF was not needed)

Note: if you look at the code, the author does not use standard stacking. Instead of building an matrix (where are the models), he creates a vector (i. e. numpy array). Then after each iteration the results are summed together and the final result can be obtained by dividing by the number of models. Furthermore, the 16 LightGBM and 5 NN are all the same models. He only set different random seeds to create two more robust level 0 models.

Let us look at level 1 models next. Meta models are in general quite simple. Even though it is possible to use more complicated models, many teams decide to use either Generalized Linear Models or simple averaging in the last level.

Here are some examples (combination of best $n$ submissions / level 1):

  • Allstate Claims Severity:
  • Porto Seguro’s Safe Driver Prediction:
  • Mercedes-Benz Greener Manufacturing:
  • Quora Question Pairs:
  • Application Screening

The teams employed quite a few other techniques to enhance the Stacking method. But it is not possible to know whether these techniques also work for other competitions. Stacking always depends on the data at hand. If there was a method that always works, then all teams would use it.

Other variations of Stacking (e. g. Blending) can be found here. However, in research papers I have never seen any of those variations.

Research on Stacking

Most people use their own heuristics to determine the structure of the ensemble. However, intuition or rules of thumb are often not as good as methods that are guaranteed to be optimal.

This is why, I list here some results that can be found in papers:

  1. The base learners must be diverse (i. e. make errors at different training examples) [2].
  2. The weights of the base learners can be constrained to be a convex combination without loss of performance (non-negative, sum to one) [2].
  3. Stacking performs asymptotically as well as the best possible weighted combination of the base learners [3].
  4. Diversity measures alone are ineffective in consistently achieving ensembles with good generalization [4].
  5. The base learners must be accurate [5].
  6. For regression tasks the following diversity measures are useful [6]:
    • correlation coefficient
    • covariance
    • dissimilarity measure
    • chi-square
    • mutual information
  7. For classification tasks the following diversity measures are useful [4]:
    • disagreement measure
    • double-fault measure
    • Kohavi-Wolpert variance
    • measurement of inter-rater agreement
    • generalized diversity
    • measure of “difficulty”

I included results on diversity measures, because there are some ideas on Kaggle to use diversity exclusively [11].

Besides these papers that analyze how to build more effective ensembles, there seems to be a general trend towards some kind of Automated Machine Learning (AutoML). This means choosing an optimization algorithm to find the best structure for the ensemble and sometimes even automating the whole pipeline. Examples include:

  • GA-Stacking (genetic algorithm) [7]
  • Ant Colony Optimization [8]
  • Artificial Bee Colony [9]


In 1992 Wolpert famously described Stacking as “black art” [10]. He wrote:

[…] there are currently no hard and fast rules saying what level 0 generalizers one should use, what level 1 generalizer one should use, what k numbers to use to form the level 1 input space, etc.

And this is still true today. In a study from 2015 researchers compared many different approaches to Stacking and found that there is no consensus [5].

Further reading

To get an overview on the topic of Stacking I can recommend [5]. There is also an interesting book called “Ensemble Methods: Foundations and Algorithms”. It covers everything from voting to ensemble pruning (ensemble selection). An implementation of some diversity measures can be found in the library brew.



[2] L. Breiman, “Stacked regressions,” Machine Learning, vol. 24, no. 1, pp. 49–64, Jul. 1996.

[3] M. J. van der Laan, E. C. Polley, and A. E. Hubbard, “Super Learner,” Statistical Applications in Genetics and Molecular Biology, vol. 6, no. 1, Jan. 2007.

[4] E. K. Tang, P. N. Suganthan, and X. Yao, “An analysis of diversity measures,” Machine Learning, vol. 65, no. 1, pp. 247–271, Jul. 2006.

[5] M. P. Sesmero, A. I. Ledezma, and A. Sanchis, “Generating ensembles of heterogeneous classifiers using Stacked Generalization,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 5, no. 1, pp. 21–34, Jan. 2015.

[6] H. Dutta, “Measuring Diversity in Regression Ensembles”, 2009.

[7] A. Ledezma, R. Aler, A. Sanchis, and D. Borrajo, “GA-stacking: Evolutionary stacked generalization”, 2010.

[8] Y. Chen, M.-L. Wong, and H. Li, “Applying Ant Colony Optimization to configuring stacking ensembles for data mining,” Expert Systems with Applications, vol. 41, no. 6, pp. 2688–2702, May 2014.

[9] P. Shunmugapriya and S. Kanmani, “Optimization of stacking ensemble configurations through Artificial Bee Colony algorithm,” Swarm and Evolutionary Computation, vol. 12, pp. 24–32, Oct. 2013.

[10] D. H. Wolpert, “Stacked generalization,” Neural Networks, vol. 5, no. 2, pp. 241–259, Jan. 1992.