# Support Vector Machines for Classification

In this post, I’ll discuss what a hyperplane is, what SVM does, and the basics of how to get the optimal hyperplane in both linear and non-linear data.

SVM is a supervised machine learning algorithm that finds the hyperplane that best separates data points into different classes. Although the algorithm can also be applied on regression (in which case it’s called Support Vector Regression, or SVR), this post will focus on its application on classification problems, where it is much easier to build intuition from when we’re first learning about the algorithm.

##### What is a hyperplane?

Simply put, a hyperplane is any line that divides any space into two distinct parts. Take a piece of paper, for instance, then draw a line (it doesn’t have to be straight) that divides the paper into two parts.

In a more formalized definition, a hyperplane is any n-1 dimensional subset of an n-dimensional space that divides that space into two distinct parts. For example, if n=2 (2-dimensional object), then a hyperplane is any n-1 dimensional object that divides the 2-D space into two parts.

Looking back to our paper example, the “space” is the piece of paper. The paper has two dimensions (length, width), thus n=2. Draw a line that divides the paper into two parts. The line is 1-dimensional (length). Thus, the line is a hyperplane of the paper.

If n=3 (3-dimensional object), then a hyperplane is any plane of two dimensions that divides that 3-D object into two distinct parts. And so it goes, for any value of n.

Now, in terms of machine learning, if we had a 2-dimensional Cartesian plane where the x-axis is Feature 1 and the y-axis is Feature 2 of a certain data set, then a hyperplane is any line that separates the data points into two distinct classes within the plane. Lines A and B are both hyperplanes of this 2-D plot. In fact, there’s an infinite number of hyperplanes in this plot. The question is – which one of them can best divide the data points into two classes?

Notice that what a hyperplane does is sort of what a classification algorithm does – creating a division between the data set. However, this is not intelligent enough to make an accurate distinction in real life. A hyperplane merely divides the data set into two, regardless of there not being an actual distinction between the two parts. This is where SVM comes in. The SVM algorithm will find the optimal hyperplane that distinguishes data from one class and data from another class.

#### Part 1: Linear Separation of Data

Let’s first take a look at a simple case wherein our data can be separated by a straight line.

###### How does SVM find the best hyperplane?

One might intuitively think that in order to differentiate one class from another, an algorithm must find the differences in the features between each class. Interestingly, SVM does the opposite – it looks at the similarities between classes. To see how, let’s take a look at classification between two big cats: a leopard and a jaguar.

Leopards and jaguars are cats that look so much alike. However, there are two features that can help us identify one from the other: the length of the tail and the body size.

Imagine we have data about the body size and tail length of several jaguars and leopards. We can plot the data to visualize where the classes lie: Jaguars cluster on the upper left corner of the plot because of their generally bigger body size and shorter tail. Leopards cluster on the lower right due to their slightly smaller stature and longer tail.

What SVM does is ignore the data points that are distinctly jaguars and leopards. Instead, it looks at the data points where jaguars start to look like leopards. These data points are encircled in the image below. The area where jaguars are closest to the leopards is where we can draw a hyperplane that can separate these two classes. I’ve drawn a representative line hyperplane A within this area. We can say that the encircled data points are the support vectors of hyperplane A. Two jaguars (encircled on the left side of line A) have almost the same features as the three leopards (encircled on the right side of Line A)

Support vectors are, therefore, the closest data points in distance from a representative hyperplane that separates two classes. By using support vectors to identify the dividing hyperplane, SVM effectively ignores the rest of the data set and focuses only on select data points to identify the best classification model.

The problem with hyperplane A is that it is too close to the leopard support vectors. This could mean that if, for example, we introduced another leopard somewhere close to the jaguar support vectors, then it is likely that SVM will incorrectly classify it as a jaguar since it has more leeway on the jaguar side than on the leopard side. We could say that hyperplane A is biased towards jaguars.

We can draw another hyperplane – hyperplane B – slightly moved to the center such that the distance of this hyperplane from jaguar support vectors is almost the same as its distance from leopard support vectors. We can say that hyperplane B is better than hyperplane A at separating the two classes since it has reduced the bias towards a particular class. The edges of the margin of hyperplane B are indicated by the green lines. The distance of the margin edge to the hyperplane is indicated by the purple lines.

The distance between a hyperplane and its closest support vector is called the margin. As a rule, the margins on each class partition of a hyperplane must be equidistant from that hyperplane. The illustration below compares the margins of hyperplanes A and B. Following the constraint that the margins must be equal on both classes, we can say that hyperplane B has maximized the margin so that there is a fair distance between each support vector of each class and the hyperplane.

Thus, the goal of SVM is to find the maximum margin that reduces the distance of each support vector from the dividing hyperplane.

#### Part 2: Non-Linear Separation of Data

What if we cannot use a straight line to separate classes due to the non-linear distribution of the data set, like the circular shape of the data in the 2-D plane below?

For non-linear data, SVM performs a method called a kernel function, or the kernel trick, which performs the separation of the classes in another dimension. For example, given the 2-dimensional plot above, we can add another dimension which we can call dimension z.

Each data point’s original x and y values remain the same, but the data points will have an additional value z as the new dimension which sets them apart. In Image A, we have added a new dimension z, but the pink and blue data points still do not look linearly separable. That’s because we haven’t set a value for z yet!

The kernel trick is this: If the value of z can be computed as z = x2 + y2, then the difference in values of each data point will be “exaggerated”, in the sense that those that are away from the origin (the center of the plane) will be very high and those close to the origin will be low in the z-dimension. Take a look at Image B – the data points’ x and y coordinates remain the same, but in their z-dimension, each takes the value x2 + y2, which effectively pulls their z-coordinate to a large positive value. Image B: “Raising” the data points at the outer edge of the circular data set through the equation x^2 + y^2 in the z-dimension. The graph of x^2 + y^2 looks like this. The data in the outer edges become more pronounced, while the data in the middle stay at the bottom of the curve.

Once we apply this equation on the z-axis, then we get a result similar to Image C. Note that I changed the perspective on the plane a bit to see where each class lies. Image C : x^2 + y^2 applied on the z-dimension, causing the blue data points far from the center to rise and the data points at the middle to remain at the bottom.

You’ll notice that the blue data points can now be separated from the pink data points with a straight line, like so:

Recall that the data points retain the original values of their 2 dimensions (x and y), and by using the kernel trick, we have simply transformed the data in the additional dimension z in order for SVM to find the optimal straight-line hyperplane.

There are other ways (i.e., other kernels) to transform non-linear data in a higher dimension, but we’ll settle on this for the basics. The important thing to remember is that SVM needs to use a hyperplane to separate data into classes, and the best way it can find the optimal one for a non-linear data set is to look in a higher dimension.