Geometry of hard-margin SVM and Optimization

7 minute read

In this blog post, I will give a geometric interpretation of hard-margin SVM and solve the corresponding optimization problem using quadratic programming. See also the paper [1] for some proofs.

Overview

Let and be nonempty disjoint and linearly separable convex sets. Then according to the hyperplane separation theorem, there exist and such that

Our task is to find the separating hyperplane such that the distance to each set is as large as possible. In other words, the hyperplane must be in the middle between the two sets and .

There are two ways to define the objective for the optimization problem and then use the output:

  • Method 1
    • Optimization: Find the points and that are closest to each other.
    • Output: Calculate the normal vector and the bias (offset) in order to construct a hyperplane.
  • Method 2:
    • Optimization: Find the hyperplanes and that are furthest to the right (i.e. set ) and left (i.e. set ), while maximizing the distance to each other.
    • Output: Calculate in order to construct a new hyperplane. The optimization step gave us the normal vector , one bias for $C$ and one bias for $D$.

The second approach is how the hard-margin primal SVM is normally taught (see Ng’s lecture notes on SVM). The two approaches differ in how they obtain the final separating hyperplane, but are otherwise equivalent (i.e. primal/dual).

To see the equivalence note that due to the constraint “furthest to the right and left” (method 2) the hyperplanes are supporting hyperplanes. This means both hyperplanes have at least one point on the boundary and , such that for all , and for all , .

Thus, by finding supporting hyperplanes we are actually looking for the closest point in each convex set and .

An example will make this all a bit clearer. Let us plot two convex sets with Python.

import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy.spatial import ConvexHull

sns.set()

random = np.random.RandomState(seed=8)

def plot_hull(points):
    hull = ConvexHull(points)

    for simplex in hull.simplices:
        plt.plot(points[simplex, 0], points[simplex, 1], 'k-')

    sns.regplot(x=points[:,0], y=points[:,1], fit_reg=False)

C = random.randint(low=0, high=5, size=(10, 2))
plot_hull(C)

D = random.randint(low=5, high=10, size=(10, 2))
plot_hull(D)

plt.ylim(ymin=-1, ymax=10)
plt.savefig("plot.svg", format="svg")
plt.show()

We can see in the following plot that the two sets are disjoint and nonempty. Hence, it is possible to find a separating hyperplane.

Hulls

Method 1 will give us the following blue separating hyperplane.

Graph

Method 2 returns two red supporting hyperplanes and a blue separating hyperplane.

Graph 2

Now, let us try out both methods.

Method 1

The sets and in the Python code can be written in matrix notation as follows

Points like were removed, because they can be represented as a convex combination of other points.

If we look at the plot, we can already see that and are the closest two points. However, if we didn’t know the answer yet, we could find it by solving the following optimization problem:

In words: find and (i.e. inside the convex hulls) such that the squared distance between the two points is minimized. The corresponding Python code looks like this:

from scipy.optimize import minimize
import numpy as np

C = np.array([[3, 4], [0, 4], [1, 1], [2, 0], [3, 0], [4, 1]]).T
C = np.c_[C, np.zeros(C.shape)]

D = np.array([[7, 5], [9, 5], [9, 9], [8, 9], [6, 8], [6, 6]]).T
D = np.c_[np.zeros(D.shape), D]

x = np.zeros((C.shape[1], 1))

cons = ({'type': 'eq',
         'fun' : lambda x: np.sum(x[:C.shape[1] // 2,]) - 1},
         {'type': 'eq',
         'fun' : lambda x: np.sum(x[C.shape[1] // 2:,]) - 1},
         {'type': 'ineq',
         'fun' : lambda x: x})
res = minimize(lambda x : 0.5 * (C @ x - D @ x).T @ (C @ x - D @ x), x, method='SLSQP', constraints=cons)

print("c^* = {}".format(C @ res.x))
print("d^* = {}".format(D @ res.x))

In order to have only one vector to optimize, I added zeros to both matrices. The program returns the same and we found before.

The next step is to construct a normal vector between the points and .

Then find the distance from the origin to the point halfway between the two closest points along the normal.

We have finally found our separating hyperplane. This means

To make this more concrete, let us write the hyperplane in slope-intercept form.

The hyperplane can now be easily graphed.

plt.plot([3, 6], [4,6], 'ro-', label=r'$d - c$')

x = np.linspace(-100, 100)
plt.plot(x, -1.5 * x + 11.75, 'b', label=r'$f(x) = -1.5x + 11.75$')

And the graph is

Graph

To understand the red and blue “line”: the red line gives us the slope, any point on the blue line gives us the offset. All the points that are orthogonal to the red line lie on the blue line.

Another way to see this: draw a position vector (any point on the blue line), then place the tail of at the head of the arrow . By considering all points orthogonal to , we obtain again the blue line.

Method 2

In the first method, we had to find the two closest points. This time we have to figure out which two points give us the optimal normal vector and then find two offsets , .

To find we could choose , but the second vector is harder to find. We can even choose a vector outside the sets and . For example, together with would not give us the maximal distance between the two hyperplanes.

Let us choose instead the two points and . The first point will be located on the first supporting hyperplane. Next, calculate .

Thus, and .

is the leftmost point in the set , therefore it will be on the second supporting hyperplane. Since the two hyperplanes are parallel, does not change. Then the second offset is .

The separating hyperplane is given by . Then multiply by . We obtain the same result as in method 1.

Let us plot the three hyperplanes.

Graph

Now, we will set up an optimization problem.

In words, maximize the distance between two parallel supporting hyperplanes. Hence, the constraints are and .

Since the distance between two planes is given by , maximizing this fraction is equivalent to minimizing and .

from scipy.optimize import minimize
import numpy as np

C = np.array([[3, 4], [0, 4], [1, 1], [2, 0], [3, 0], [4, 1]])
D = np.array([[7, 5], [9, 5], [9, 9], [8, 9], [6, 8], [6, 6]])

x = np.ones((C.shape[1] + 2))

cons = ({'type': 'ineq',
         'fun' : lambda x: C @ x[:2] - x[2]},
         {'type': 'ineq',
         'fun' : lambda x: -D @ x[:2] + x[3]})
res = minimize(lambda x : 1/2 * x[:2] @ x[:2].T - (x[2] - x[3]), x, method='SLSQP', constraints=cons)

print("a^* = {}".format(res.x[:2]))
print("alpha^* = {}".format(res.x[2]))
print("beta^* = {}".format(res.x[3]))

In the last section, we will solve the problem using standard QP.

The program returns

which means the distance is

Standard hard-margin SVM

If you look on the internet (e. g. Wikipedia), you will see the hard-margin optimization problem often written like this

We can show that this optimization problem is equivalent to the one stated above. First, transform the problem into vector form, rename the variables and do minor rearrangements.

Then set and .

Create a set and give each a value . The two optimization problems are now the same.

Quadratic Programming (QP)

Let us now solve the standard hard-margin problem with QP. We will use this time cvxopt and a more complicated example.

First, we have to write the optimization problem in standard form.

where is a weight vector, , , and .

Then the code is:

import numpy as np
from cvxopt import matrix, solvers
from sklearn.datasets.samples_generator import make_blobs

P = matrix(np.array([[1, 0, 0], [0, 1, 0], [0, 0, 0]]), tc='d')
q = matrix(np.zeros((3, 1)), tc='d')

C, _ = make_blobs(n_samples=100, centers=[(-5, -5), (0, 0)], n_features=2, random_state=0)
C = np.c_[-1 * C, np.ones((C.shape[0], 1))]

D, _ = make_blobs(n_samples=100, centers=[(5, 5), (8, 8)], n_features=2, random_state=1)
D = np.c_[D, -1 * np.ones((D.shape[0], 1))]

G = matrix(np.vstack((C, D)), tc='d')
h = matrix(-1 * np.ones((C.shape[0] + D.shape[0], 1)), tc='d')

sol = solvers.qp(P, q, G, h)['x']
print("a^* = [[{}],[{}]]".format(sol[0], sol[1]))
print("b^* = {}".format(sol[2]))
print("Separating hyperplane: {} * x + {} * y = {}".format(sol[0], sol[1], sol[2]))
print("Supporting hyperplane 1: {} * x + {} * y = {} - 1".format(sol[0], sol[1], sol[2]))
print("Supporting hyperplane 2: {} * x + {} * y = {} + 1".format(sol[0], sol[1], sol[2]))

The solution is and .

Remember we set before and . Hence, the bias for the separating hyperplane is given by . This formulation of the optimization problem finds directly the separating hyperplane.

If you use method 1, you will find the same solution in terms of points and . The following plot shows both method 1 and method 2 (with QP).

Graph

References

[1] K. Bennett and E. Bredensteiner. Duality and Geometry in SVM Classifiers. In P. Langley, eds., Proc. of Seventeenth Intl. Conf. on Machine Learning, p. 57-64, Morgan Kaufmann, San Francisco, 2000.

Comments