# Geometry of hard-margin SVM and Optimization

In this blog post, I will give a geometric interpretation of hard-margin SVM and solve the corresponding optimization problem using quadratic programming. See also the paper [1] for some proofs.

## Overview

Let $C$ and $D$ be nonempty disjoint and linearly separable convex sets. Then according to the hyperplane separation theorem, there exist $\mathbf{a} \neq \mathbf{0}$ and $b$ such that

Our task is to find the separating hyperplane $\mathbf{a}^T\mathbf{x} = b$ such that the distance to each set is as large as possible. In other words, the hyperplane must be in the middle between the two sets $C$ and $D$.

There are two ways to define the objective for the optimization problem and then use the output:

• Method 1
• Optimization: Find the points $\mathbf{c} \in C$ and $\mathbf{d} \in D$ that are closest to each other.
• Output: Calculate the normal vector $\mathbf{a} = \mathbf{d} - \mathbf{c}$ and the bias (offset) $b = \frac{\mathbf{c} + \mathbf{d}}{2}\mathbf{a}^T$ in order to construct a hyperplane.
• Method 2:
• Optimization: Find the hyperplanes $\mathbf{a}^T\mathbf{x} = \alpha$ and $\mathbf{a}^T\mathbf{x} = \beta$ that are furthest to the right (i.e. set $C$) and left (i.e. set $D$), while maximizing the distance to each other.
• Output: Calculate $b = \frac{\alpha+\beta}{2}$ in order to construct a new hyperplane. The optimization step gave us the normal vector $\mathbf{a}$, one bias $\alpha$ for $C$ and one bias $\beta$ for $D$.

The second approach is how the hard-margin primal SVM is normally taught (see Ng’s lecture notes on SVM). The two approaches differ in how they obtain the final separating hyperplane, but are otherwise equivalent (i.e. primal/dual).

To see the equivalence note that due to the constraint “furthest to the right and left” (method 2) the hyperplanes are supporting hyperplanes. This means both hyperplanes have at least one point on the boundary $\mathbf{c} \in \text{bd}(C)$ and $\mathbf{d} \in \text{bd}(D)$, such that for all $\mathbf{x} \in C$, $\mathbf{a}^T\mathbf{x} \leq \mathbf{a}^T\mathbf{c} = \alpha$ and for all $\mathbf{x} \in D$, $\mathbf{a}^T\mathbf{x} \geq \mathbf{a}^T\mathbf{d} = \beta$.

Thus, by finding supporting hyperplanes we are actually looking for the closest point in each convex set $C$ and $D$.

An example will make this all a bit clearer. Let us plot two convex sets with Python.

We can see in the following plot that the two sets are disjoint and nonempty. Hence, it is possible to find a separating hyperplane.

Method 1 will give us the following blue separating hyperplane.

Method 2 returns two red supporting hyperplanes and a blue separating hyperplane.

Now, let us try out both methods.

## Method 1

The sets $C$ and $D$ in the Python code can be written in matrix notation as follows

Points like $(1, 3)$ were removed, because they can be represented as a convex combination of other points.

If we look at the plot, we can already see that $% $ and $% $ are the closest two points. However, if we didn’t know the answer yet, we could find it by solving the following optimization problem:

In words: find $\mathbf{c} \in \text{Conv}\left(C\right)$ and $\mathbf{d} \in \text{Conv}\left(D\right)$ (i.e. inside the convex hulls) such that the squared $L_2$ distance between the two points is minimized. The corresponding Python code looks like this:

In order to have only one vector $\mathbf{x}$ to optimize, I added zeros to both matrices. The program returns the same $\mathbf{c}^*$ and $\mathbf{d}^*$ we found before.

The next step is to construct a normal vector $\mathbf{a}$ between the points $\mathbf{c}^* = \begin{bmatrix}3\\4\end{bmatrix}$ and $\mathbf{d}^* = \begin{bmatrix}6\\6\end{bmatrix}$.

Then find the distance from the origin to the point halfway between the two closest points along the normal.

We have finally found our separating hyperplane. This means

To make this more concrete, let us write the hyperplane in slope-intercept form.

The hyperplane can now be easily graphed.

And the graph is

To understand the red and blue “line”: the red line gives us the slope, any point on the blue line gives us the offset. All the points that are orthogonal to the red line lie on the blue line.

Another way to see this: draw a position vector $\overrightarrow{OB}$ (any point on the blue line), then place the tail of $\mathbf{a}$ at the head of the arrow $\overrightarrow{OB}$. By considering all points orthogonal to $\mathbf{a}$, we obtain again the blue line.

## Method 2

In the first method, we had to find the two closest points. This time we have to figure out which two points give us the optimal normal vector $\mathbf{a}$ and then find two offsets $\alpha$, $\beta$.

To find $\mathbf{a}$ we could choose $% $, but the second vector is harder to find. We can even choose a vector outside the sets $C$ and $D$. For example, $% $ together with $% $ would not give us the maximal distance between the two hyperplanes.

Let us choose instead the two points $% $ and $% $. The first point will be located on the first supporting hyperplane. Next, calculate $\mathbf{a}$.

Thus, $\mathbf{a} = \begin{bmatrix}1.5\\1\end{bmatrix}$ and $\alpha = 8.5$.

$% $ is the leftmost point in the set $D$, therefore it will be on the second supporting hyperplane. Since the two hyperplanes are parallel, $\mathbf{a}$ does not change. Then the second offset is $\beta = \mathbf{a}^T\begin{bmatrix}6\\6\end{bmatrix} = 15$.

The separating hyperplane is given by $\begin{bmatrix}1.5\\1\end{bmatrix}\mathbf{x} = \frac{\alpha + \beta}{2} = 11.75$. Then multiply by $2$. We obtain the same result as in method 1.

Let us plot the three hyperplanes.

Now, we will set up an optimization problem.

In words, maximize the distance between two parallel supporting hyperplanes. Hence, the constraints are $\forall x \in C, \mathbf{x}^T\mathbf{a} - \alpha \geq 0$ and $\forall x \in D, -\mathbf{x}^T\mathbf{a} + \beta \geq 0$.

Since the distance between two planes is given by $\frac{\alpha - \beta}{\lVert\mathbf{a}\rVert_2}$, maximizing this fraction is equivalent to minimizing $\beta - \alpha$ and $\lVert\mathbf{a}\rVert_2$.

In the last section, we will solve the problem using standard QP.

The program returns

which means the distance is

### Standard hard-margin SVM

If you look on the internet (e. g. Wikipedia), you will see the hard-margin optimization problem often written like this

We can show that this optimization problem is equivalent to the one stated above. First, transform the problem into vector form, rename the variables and do minor rearrangements.

Then set $\alpha = b + 1$ and $\beta = b - 1$.

Create a set $A = C \cup D$ and give each $\mathbf{x}^{(i)} \in A$ a value $y^{(i)} \in \{-1, 1\}$. The two optimization problems are now the same.

Let us now solve the standard hard-margin problem with QP. We will use this time cvxopt and a more complicated example.

First, we have to write the optimization problem in standard form.

where $\mathbf{x}$ is a $3\times 1$ weight vector, $% $, $\mathbf{q} = \begin{bmatrix}0 \\ 0 \\ 0\end{bmatrix}$, $% $ and $\mathbf{h} = \begin{bmatrix}\mathbf{-1}\end{bmatrix}$.

Then the code is:

The solution is $\mathbf{a}^* = \begin{bmatrix}-0.1978089915385707\\-0.2293510755695925\end{bmatrix}$ and $b^* = -1.234441696267498$.

Remember we set before $\alpha = b + 1$ and $\beta = b - 1$. Hence, the bias for the separating hyperplane is given by $\frac{\alpha+\beta}{2} = \frac{b + 1 + b - 1}{2} = b$. This formulation of the optimization problem finds directly the separating hyperplane.

If you use method 1, you will find the same solution in terms of points $\mathbf{d}^* = \begin{bmatrix}2.16323595\\ 1.33652795\end{bmatrix}$ and $\mathbf{c}^* = \begin{bmatrix}4.31965563\\3.83680381\end{bmatrix}$. The following plot shows both method 1 and method 2 (with QP).

## References

[1] K. Bennett and E. Bredensteiner. Duality and Geometry in SVM Classifiers. In P. Langley, eds., Proc. of Seventeenth Intl. Conf. on Machine Learning, p. 57-64, Morgan Kaufmann, San Francisco, 2000.

Categories:

Updated: