How to find and combine ML algorithms to improve your score

In this post, I will first explain how to find the best ML algorithms for a data set using some simple math. Then I will introduce a way to combine multiple algorithms.

Finding the right algorithms

Let’s assume you have already prepared your data. What most people would do now is putting all the available algorithms in a list, using cross validation and then choosing the best algorithm based on some loss function. However, this method will not always give you the best results.

Some algorithms make a number of assumptions that need to be considered before evaluating the loss function. For example k-NN regression and gradient boosting can deal without any problems with multicollinearity. Features like “sum of n independent variables” can even improve your score. But this is not the case for algorithms based on linear regression. You will see your score significantly lowered if you include features that are perfectly correlated to each other.

In other cases you include a feature that works great for linear models, but really badly for gradient boosting. For example some features could have cubic growth and you include terms $x_i^2$, $x_i^3$. Your polynomial regression will have a lower loss, but your boosting algorithm will just become much slower. The bottom line is what works for some algorithms might not work for others.

To deal with this problem, you have to consider the features also as parameters of the individual algorithm. To make this more concrete, let’s consider the following code:

This function reads a csv file, adds some features and returns the data. The feature engineering part is here not really important, the main thing is you can call this function as follows:

And this is simply counting in binary. What we can do now is generating the cartesian product of $n$ parameters and trying out all the possibilities. This means here $\{0, 1\} \times \cdots \times \{0, 1\} = \{0, 1\}^n$ with $n = 4$. We have to try therefore $2^4 = 16$ possibilities.

We are of course not limited to only using zeros and ones. If we had a feature like “polynomial regression”, we could just add $M = \{1, 2, 3, \dots\}$ to the cartesian product where $M$ is the degree of the polynomial.

In code this means:

The process is then:

1. call load_csv with $\{0, 0, 0, 0\}$
2. do cross validation on all the algorithms
3. call load_csv with with $\{0, 0, 0, 1\}$
4. and so on

Finally, after trying out all the possibilities we have found the best ML algorithms for the task at hand. Because the algorithm depends on the selected features, we have to create a new class as follows:

Then the algorithms can be called like this:

Combining the algorithms

First, we have to decide how many ML algorithms we want to combine. Let’s say we want to choose $k$ algorithms from a total of $n$ algorithms. This means we need to consider $\left(\!\!{\binom {n}{k}}\!\!\right)={\binom {n+k-1}{k}}$ possible combinations.

We are using here combinations with replacement, because it is possible that $k$ times the same algorithm produces better results than $k$ different algorithms.

More concretely, we can create a set $S = \{A_1, \dots, A_n\}$ that contains $n$ algorithms. Then we draw $1 \leq k \leq n$ algorithms from that set, combine the results via a meta model and calculate the loss. In the end, we choose the combination with the least loss.

The first part looks like this:

Here we are considering combinations of $k = 2$, $k = 3$ and $k = 4$. The function combinations_with_replacement gives tuples of the form $(1, 2)$, $(1, 3, 4)$, $(1, 1, 5, 3)$. We can use these numbers as indices to our set $S$ i. e. $(A_1, A_2)$, $(A_1, A_3, A_4)$, $(A_1, A_1, A_5, A_3)$.

We could now combine the algorithms (tuples) we have found via blending or stacking. However, these methods are not as clean and require a bit more code. This is why we will do something simpler in this post. If you are focused on winning a competition, you should probably stick to blending/stacking because these methods improve your score even more. In case you don’t know yet how stacking works, take a look here for some code: [1].

Let $S$ be still the set that contains the algorithms. Then

The result is an $m \times \vert S\vert$ matrix that contains the results of the algorithms $A_1, \dots, A_n$. We can select specific algorithms like this (where [1, 2], [1, 3, 4] are the tuples from before):

The next step is to find the meta model. Let $\mathbf{\hat{Y}_i}$ be an $m \times \vert S_i\vert$ matrix with $S_i \subseteq S$. We have to find a vector $\boldsymbol{\beta_i}$ such that $\mathbf{\hat{Y}_i}\boldsymbol{\beta_i} = \mathbf{y}$.

In theory, we can just use the squared $L^2$ norm as loss $\vert\vert\mathbf{y} - \mathbf{\hat{Y}_i}\boldsymbol{\beta_i}\vert\vert_2^2 = (y - \mathbf{\hat{Y}_i}\boldsymbol{\beta_i})(y - \mathbf{\hat{Y}_i}\boldsymbol{\beta_i})^T$. Followed by finding the minimum via the first matrix derivative, we get $\boldsymbol{\hat{\beta}_i} = \mathbf{\hat{Y}_i}^+\mathbf{y}$. But this would just be linear regression.

Let us try something different and add some constraints to $\boldsymbol{\beta_i}$. Assume $\boldsymbol{\beta_i} \in \{0.0, 0.1, \dots, 0.9, 1.0\}^{\vert S_i\vert}$ and $\sum_{j=1}^{\vert S_i\vert} \beta_{ij} = 1$. We can now use either some kind of mathematical optimization or try out all the possibilities to find $\boldsymbol{\hat{\beta}_i}$. For the sake of simplicity, I chose the latter method.

In the first section, we used the cartesian product $\{0, 1\}^n$. This time our base is much higher, so we have to be careful about the number of possibilities. At least $10^6$ possibilities should still be feasible on a modern computer. If we are sure that the contributions of the individual algorithms are never higher than $0.5$, then we can set $\boldsymbol{\beta_i}$ to a maximum of $0.5$ instead of $1.0$.

Let’s look at some code:

In order to get a more accurate result, I set $\boldsymbol{\beta_i} \in \{0.0, 0.01, 0.02, \dots, 0.49\}^{\vert S_i\vert}$.

So we can see that our meta model calculates just the weighted average of $\vert S_i\vert$ algorithms.

References

Categories:

Updated: