How to find and combine ML algorithms to improve your score

6 minute read

In this post, I will first explain how to find the best ML algorithms for a data set using some simple math. Then I will introduce a way to combine multiple algorithms.

Finding the right algorithms

Let’s assume you have already prepared your data. What most people would do now is putting all the available algorithms in a list, using cross validation and then choosing the best algorithm based on some loss function. However, this method will not always give you the best results.

Some algorithms make a number of assumptions that need to be considered before evaluating the loss function. For example k-NN regression and gradient boosting can deal without any problems with multicollinearity. Features like “sum of n independent variables” can even improve your score. But this is not the case for algorithms based on linear regression. You will see your score significantly lowered if you include features that are perfectly correlated to each other.

In other cases you include a feature that works great for linear models, but really badly for gradient boosting. For example some features could have cubic growth and you include terms , . Your polynomial regression will have a lower loss, but your boosting algorithm will just become much slower. The bottom line is what works for some algorithms might not work for others.

To deal with this problem, you have to consider the features also as parameters of the individual algorithm. To make this more concrete, let’s consider the following code:

def load_csv(csv_file, add_time, remove_capacity, add_feature1, add_feature2):
    df = pd.read_csv(csv_file, parse_dates=["timestamp"])

    if add_time:
        h = df["timestamp"].dt.hour + df["timestamp"].dt.minute / 60.
        df["hour_x"] = np.cos(2. * np.pi * (h / 24.))
        df["hour_y"] = np.sin(2. * np.pi * (h / 24.))

        m = df["timestamp"].dt.month + df["timestamp"].dt.dayofweek / 7
        df["month_x"] = np.cos(2. * np.pi * (m / 30.))
        df["month_y"] = np.sin(2. * np.pi * (m / 30.))

    df.drop("timestamp", axis=1, inplace=True)

    if remove_capacity:
        df.drop("capacity", axis=1, inplace=True)

    columns = ["speed10", "speed20", "speed30"]

    if add_feature1:
        df["speed_sum"] = df[columns].sum(axis=1)
        df["speed_mean"] = df[columns].mean(axis=1)
        df["speed_median"] = df[columns].median(axis=1)
        df["speed_var"] = df[columns].var(axis=1)
        df["speed_std"] = df[columns].std(axis=1)

    if add_feature2:
        df["speed_shifted_5"] = df[columns].sum(axis=1).shift(5)

    df.dropna(inplace=True)

    return df

This function reads a csv file, adds some features and returns the data. The feature engineering part is here not really important, the main thing is you can call this function as follows:

load_csv(csv_file, 0, 0, 0, 0)
load_csv(csv_file, 0, 0, 0, 1)
load_csv(csv_file, 0, 0, 1, 0)
...

And this is simply counting in binary. What we can do now is generating the cartesian product of parameters and trying out all the possibilities. This means here with . We have to try therefore possibilities.

We are of course not limited to only using zeros and ones. If we had a feature like “polynomial regression”, we could just add to the cartesian product where is the degree of the polynomial.

In code this means:

from itertools import product

for i, combination in enumerate(product(*[[0, 1]] * 4)):
    print("Combination {}/{}: {}".format(i + 1, 2 ** 4, combination))
    df = load_csv("data.csv", *combination)
    ...

The process is then:

  1. call load_csv with
  2. do cross validation on all the algorithms
  3. call load_csv with with
  4. and so on

Finally, after trying out all the possibilities we have found the best ML algorithms for the task at hand. Because the algorithm depends on the selected features, we have to create a new class as follows:

class Preprocessor(BaseEstimator, TransformerMixin):

    def __init__(self, add_time, remove_capacity, add_feature1, add_feature2):
        self.add_time = add_time
        self.remove_capacity = remove_capacity
        self.add_feature1 = add_feature1
        self.add_feature2 = add_feature2

    def fit(self, X, _):
        return self

    def transform(self, X):
        df = X.copy()

        if self.add_time:
            h = df["timestamp"].dt.hour + df["timestamp"].dt.minute / 60.
            df["hour_x"] = np.cos(2. * np.pi * (h / 24.))
            df["hour_y"] = np.sin(2. * np.pi * (h / 24.))

            m = df["timestamp"].dt.month + df["timestamp"].dt.dayofweek / 7
            df["month_x"] = np.cos(2. * np.pi * (m / 30.))
            df["month_y"] = np.sin(2. * np.pi * (m / 30.))

        df.drop("timestamp", axis=1, inplace=True)

        if self.remove_capacity:
            df.drop("capacity", axis=1, inplace=True)

        columns = ["speed10", "speed20", "speed30"]

        if self.add_feature1:
            df["speed_sum"] = df[columns].sum(axis=1)
            df["speed_mean"] = df[columns].mean(axis=1)
            df["speed_median"] = df[columns].median(axis=1)
            df["speed_var"] = df[columns].var(axis=1)
            df["speed_std"] = df[columns].std(axis=1)

        if self.add_feature2:
            df["speed_shifted_5"] = df[columns].sum(axis=1).shift(5)

        df.dropna(inplace=True)

        return df

Then the algorithms can be called like this:

make_pipeline(Preprocessor(1, 0, 0, 0),
              LGBMRegressor(learning_rate=0.003, n_estimators=4000, objective="regression_l1"))

Combining the algorithms

First, we have to decide how many ML algorithms we want to combine. Let’s say we want to choose algorithms from a total of algorithms. This means we need to consider possible combinations.

We are using here combinations with replacement, because it is possible that times the same algorithm produces better results than different algorithms.

More concretely, we can create a set that contains algorithms. Then we draw algorithms from that set, combine the results via a meta model and calculate the loss. In the end, we choose the combination with the least loss.

The first part looks like this:

from itertools import combinations_with_replacement

for k in range(2, 5):
    for c in combinations_with_replacement(range(0, n), k):
        print(c)

Here we are considering combinations of , and . The function combinations_with_replacement gives tuples of the form , , . We can use these numbers as indices to our set i. e. , , .

We could now combine the algorithms (tuples) we have found via blending or stacking. However, these methods are not as clean and require a bit more code. This is why we will do something simpler in this post. If you are focused on winning a competition, you should probably stick to blending/stacking because these methods improve your score even more. In case you don’t know yet how stacking works, take a look here for some code: [1].

Let be still the set that contains the algorithms. Then

y_hat = np.zeros((y.shape[0], len(S)))
for i, model in enumerate(S):
    y_hat[:, i] = cross_val_predict(model, X, y, cv=5)

The result is an matrix that contains the results of the algorithms . We can select specific algorithms like this (where [1, 2], [1, 3, 4] are the tuples from before):

y_hat[:, [1, 2]]
y_hat[:, [1, 3, 4]]
...

The next step is to find the meta model. Let be an matrix with . We have to find a vector such that .

In theory, we can just use the squared norm as loss . Followed by finding the minimum via the first matrix derivative, we get . But this would just be linear regression.

Let us try something different and add some constraints to . Assume and . We can now use either some kind of mathematical optimization or try out all the possibilities to find . For the sake of simplicity, I chose the latter method.

In the first section, we used the cartesian product . This time our base is much higher, so we have to be careful about the number of possibilities. At least possibilities should still be feasible on a modern computer. If we are sure that the contributions of the individual algorithms are never higher than , then we can set to a maximum of instead of .

Let’s look at some code:

search_space = np.around(np.arange(0.0, 0.5, 0.01), decimals=2)
best_combination = [[], 0]
for combination in product(*[search_space] * n_models):
    s = np.sum(combination)
    if s != 1:
        continue

    combined = np.average(y_hat[:, your_comb], axis=1, weights=combination)
    error = your_loss_function(y, combined)

    if not best_combination[0] or error < best_combination[1]:
        best_combination[0] = combination
        best_combination[1] = error

In order to get a more accurate result, I set .

So we can see that our meta model calculates just the weighted average of algorithms.

References

[1] https://github.com/log0/vertebral/blob/master/stacked_generalization.py

Comments