Ok, you read Part 1 and you understand how to draw a hyperplane. You're now officially a machine learning expert. The rest is superficial.
But let's continue building this support vector machine.
The first step is to find the parameters of the hyperplane (i.e. the parameters w and b) that correspond the best hyperplane we could draw in order to properly separate the "+" and "-" observations.
How do we define the "best" separation? While writing this article I stumbled upon a blog post that explains the answer to this question so well that I couldn't do it better. So I will redirect you to this post, there is no point in me re-doing the work: http://www.svm-tutorial.com/2014/11/svm-understanding-math-part-1/
Quick summary: the best separating hyperplane is the one that will be right in the middle of the "gap" between the "+" and "-" support vectors.
By definition, the margin is the distance between that separating hyperplane and the closest data points on each side (i.e. the support vectors).
Therefore, we can define the SVM problem (i.e. finding the best separating hyperplane) as a constrained optimization problem:
1) maximize the separating margin. Since the margin is the distance between the closest data points and the separating hyperplane, and we're trying to get the hyperplane to be as far as possible from these closest data points on each side, then the mathematical equivalent of this is to maximize the margin.
2) But because we don't want to maximize the margin past the closest data points, we need to specify a constraint to make it stop at the support vectors. In other words, we need to tell the maximization algorithm when to stop.
Mathematically, this means that we minimize the function:
Subject to the constraint:
Let's first look at the objective function: we're essentially minimizing w (and for the moment let's ignore the less relevant stuff around it). If you remember the conclusions from part 1 of this series, minimizing w is the same as maximizing the separating margin.
Also, to be clear, because w is a vector, when I talk about minimizing w, what we're doing is trying to minimize the absolute value of the weights as a whole. The norm of w, written ||w||, is:
Where the w's are the components of the w vector for each dimension. In other words, in a two-dimensional chart, w1 is the x-axis value, and w2 is the y-axis value.
The other things around the ||w|| are there as mathematical devices to simplify derivatives later on -- they don't contribute much to the intuitive understanding of what is going on.
As for the constraint, it's saying that all vectors in the dataset should be correctly classified as being >= +1 if in the "+" class, and <= -1 if in the "-" class. Another way of putting it is that there should be no data points within the gap formed by the margin's edges.
Alright, we've successfully formulated the optimization problem that will allow us to determine the ideal separating hyperplane for a given dataset, and hopefully we've understood it as well. Stay tuned for part 3, where we will begin explaining how to solve this optimization problem.
If you want a deeper look into the math of the optimization problem we just described, I highly recommend this blog post, by the same author of the previous blog post I linked to.
What follows in this series of articles is an explanation of Support Vector Machines that focuses on intuitive understanding, rather than purely mathematical derivation. The target audience is those who have a basic knowledge of machine learning vocabulary, but aren't necessarily Ph.Ds in math.
SVMs are binary classifiers (although there exists Support Vector Regression, as well as ways to classify multinominal data -- but we won't go into that, at least not yet) whose entire purpose is to draw a line (when the feature set is 2-dimensional, or a hyperplane if the feature set is higher-dimensional). This line will be the line that separates the "+" class from the "-" class observations (we'll call them "+" and "-", but technically we use +1 and -1 as the numerical label).
The equation for a line or hyperplane, as typically seen in machine learning textbooks is:
This should be obvious, if you remember high school math, because a 2-d line is y = mx + b, so if we have more than 2 dimensions it's y = m1x1 + m2x2 + ... + b, which is more concisely expressed as y = <w, x> + b, the dot product between all w's and all x's.
But here lies a first source of confusion already. Often, next to that formula, in textbooks they'll show a 2D plot with a decision boundary line that separates the "+" observations from the "-" observations. Well, plot twist: that decision boundary that you see is NOT w, nor is it <w, x> + b.
w is actually a vector that is orthogonal to that decision boundary.
The decision boundary that you often see in those graphs is, rather, all values of x that satisfy the following equation:
As a reminder, also, keep in mind that the y axis in those graphs is NOT the y or f(x) of the f(x) = <w, x> + b formula. The y axis is in fact the second x variable, because the assumption behind those graphs is always that there are two features, x1 and x2.
So what is y = <w, x> + b? It's the decision plane: If y > 0, it classifies the observation as a "+" label, if y < 0 it classifies it as a "-" label. It's the function you will be using to perform classification. But to truly visualize this function, it is necessary to produce a 3D graph of the previous 2D one, where the Z axis is the f(x) of the formula.
That black line across the center is the decision boundary, where the values of <w,x> + b give 0. Before that, the values of y are below zero, hence classified as "-", and beyond that line, the values are above zero, hence classified as "+".
Before we delve further into the roles of w and b in that picture, we will introduce the notion of support vectors and separating margin.
The Support Vectors are those observations in the picture that are closest to the decision boundary, on each side of it.
The separating margin is the distance from the decision boundary to these support vectors. As we estimate the parameters of the SVM, we will be choosing them so that the y value for each of these support vectors is +1 or -1 (depending on if they are "+" examples or "-" examples, respectively).
The role of b in the function y = <w,x> + b is to decide the positioning of the decision boundary. In other words, it decides when the value of y crosses 0. The effect of increasing or decreasing b, then, has the effect of pulling this boundary back into the - region or pushing it further into the + region or .
The role of w is to control the "scaling", i.e. the width of the margin. In other words, it controls how big the step in the x variables needs to be in order to get to a y value +1 or -1. To understand what exactly I mean by that, it suffices to look at two examples: one where the w's are all equal to 1, and one where the w's are all equal to 2.
Let's have x1 = 1 and x2 = 1, for simplicity, and b = 0:
1) with w = (1, 1): f(x) = 1 * 1 + 1 * 1 = 2
2) with w = (2, 2): f(x) = 2 * 1 + 2 * 1 = 4
So in the first scenario, with w = 1, the same x values gave a classification of +2, while in the second scenario, the same x values gave y = +4. This means that the width of the margin (i.e. the distance from the decision boundary required to reach +1 and -1) in the first scenario was represented by the x values (0.5, 0.5), while the x values (0.25, 0.25) are enough in the second scenario to reach f(x) = +1. The thing to remember here is that the greater the values of w, the smaller the margin.
Ok, that should cover the introductory theory. You're ready to delve into the core of the Support Vector Machines. Stay tuned for part 2.
Quantum Computing is probaby the "next big thing" in computer science, and it will certainly bring AI and Machine Learning to entirely new levels of efficiency and capability. You could even argue that quantum computing (or, at the very least, a new form of fundamentally more powerful computation than the electronic switch) is a pre-requisite to achieve the goal of Artificial General Intelligence.
With IBM offering access to their quantum cloud computing, I got curious and started looking into this technology more "seriously". I stumbled upon a great website that essentially simulates a vector of QuBits (the quantum world equivalent of a traditional bit) and quantum gate operations on them. It lets you play with them and get an intuitive understanding of how quantum computing works. That being said, the explanations on the website are quite minimalist, and it's not entirely easy to understand what is going on, or what you are looking at.
I will explain here what I figured out from spending about 10 minutes playing with the simulator, and it revolves around the Hadamard gate. Disclaimer: I have very little knowledge of the world of quantum computing, and what follows are essentially my baby steps into this world, trying to figure basic things out from scratch.
When you open the Hadamard Gate tutorial, you start with a 3D view that looks like this:
The first thing to understand is what this 3D image represents. It's a visual representation of the state of the array of QuBits in the simulated quantum register.
If we have 8 QuBits (which we do in this example), this means we have a total of 2^8 possible QuBit states, i.e. 256. This can be represented visually in a 16 by 16 matrix, which is what we see. The areas that are elevated indicate a possible state.
To understand what I mean by "possible state", we have to understand how a QuBit works. A single QuBit, rather than being 0 or 1, can be both at the same time. That is, it's possible to apply an operation on a QuBit (as it turns out, this operation is the Hadamard Gate, but we'll come back to that) in order to put it into 2 possible simultaneous states. In quantum lingo: a measurement of the superimposed QuBit will have equal probabilities to become 1 or 0. (for more theoretical background, look into the notion of quantum measurement and quantum wave function collapse).
In effect, the Hadamard Gate is the Schrodinger's Cat applied to bits: it makes the bits both dead and alive...
So, if we come back to the 3D model above, there is only one elevated state: the bottom left corner, which represents the initial state of 0.
But now, we're going to "Step into" the Hadamard Gate tutorial code, and run the first gate operation, on bit 0. Suddenly, the first bit has two simultaneous possible values: 0 and 1.
We see that the 3D model now looks like this:
Which is essentially 2 bars on the bottom left corner, each representing the states 0 and 1, respectively. If we keep stepping further into the code, the second QuBit is "Schrodingered" and now we see that the 3D model displays an elevation on all of the values from 0 to 4, inclusively. When we Hadamard the 3rd QuBit, then it's all values from 0 to 8 (because 2^3 = 8, in case that wasn't obvious). The "for loop" proceeds to Hadamard all 8 bits, and soon enough the entire 3D model gets elevated, forming a cube (this indicates that all 256 values are possible, or simultaneously true).
As we step through the next loop (the one that will reapply Hadamard on each QuBit, in the same order as before) -- we'll discover another property of the Hadamard gate.
So let's step into the next loop, and re-Hadamard the first bit. Now, the 3D models seems to be sliced up:
The reason for this is simple, though possibly not too intuitive at first: re-applying the Hadamard gate on a QuBit will reset it to its original value. So what this means is that we reset the first QuBit to 0, thus removing all odd numbered values from the possible quantum states. This is the reason for the sliced up 3D model. But, why does it reset the bit?
One helpful way to think about the Hadamard gate is as a rotation within a spherical model of the QuBit called a Bloch Sphere:
Witness my amazing MS Paint skills. In this sphere, you have axes x and y around the "equator", and the z axis that goes vertically. The z axis value is essentially what decides the state of the QuBit. If the state is at
the top of the sphere (z value = 1, x and y values = 0), then the QuBit's value is 1. If the state is at the bottom of the sphere, the value is 0. What can happen, however, is that the state can be somewhere along the equator of the bloch sphere, which represents the state of superposition (both 0 and 1) that the Hadamard gate can put it in.
Visually, if we start from a QuBit value of 1 (top of sphere) and we apply the Hadamard Gate operator, what we are doing is a rotation of 180 degrees around the axis formed by x and z. Or, an equivalent operation: first rotate by 90 degrees about the Y axis, then rotate by 180 degrees about the X axis.
Visually, we get the following:
The state is now at the equator, which means that the value is simultaneously 0 and 1. If we apply the Hadamard gate again, we get back to the original state.
Similarly, if we started at value 0, and applied the Hadamard gate twice, we would be back at 0.
Hopefully this will save some time for someone curious about Quantum Computing. If you're interested in the other quantum gates, stay tuned, who knows..