Creation date: 2015-05-18
How can a machine recognize digits and what are the obstacles in doing so?
You and I can recognize digits at a single glance - well most of them anyway - but can machines do this as well?
How can a webmaster check whether a human or a computer wants to login or make a comment on a website?
Pretty sure you all know the answer: Captchas! Some of them are hard to recognize for us and they are incredibly difficult to identify for machines. Google sometimes has two captchas one strange, hardly readable one and a house number which is readable. The fact is, that you can write whatever for the house number one, cause Google doesn't know the answer for that one, but instead will use your answer for improving Google Maps. The images were taken by Google Street View cars but Google wasn't able to read the number so it uses your brain for that task. Kind of creepy but probably an effective way to make Google Maps better.
You might be thinking: Wouldn’t it be easier and less error-prone for Google to have a program that identifies the house numbers instead of making Google users do the job? And that is exactly where ML comes into play!
In this article you will learn the basic things about machine learning. We will start with some simple neural nets.
The first step in this blog is to teach the machine to learn handwritten digits and we need to learn some tools to achieve this goal.
What is a neural net and why can it be helpful to accompish this task? How does it work and how to code it in a simple and fast way?
Neural nets try to simulate the human thinking in a very elementary way. Lets have a look at one single neuron:
<small>Source: Illustration by: Quasar Jarosz (Wikimedia) License: CC BY-SA 3.0<br> I've added the legend inside the image. </small>
Neuron | Interpretation |
---|---|
Dendrites | Input vector |
Soma | Processes the input vector |
Axon | Output |
The simplest implementation of a neuron is a perceptron which isn't state of the art, but I think it's best to start with as long as you're a freshman in ML.
Here the \(x_1,x_2,x_3\) are the inputs which can be represented as a vector. In general the number of inputs doesn't matter. The circle in the middle is the soma. To calculate the single binary output, each arrow from one input to the soma has a weight. The inputs themselves have an input of either one or zero.
To compute the output the perceptron uses this formula:
\[ output = \begin{cases} 1 & \text{if } \sum_i w_ix_i > \text{ threshold} \\ 0 & \text{else} \end{cases} \]So the sum of the multiplication of each input with its weight will be compared with a threshold. If it's less or equal the threshold the output will be 0. If it's greater than the threshold it will be 1.
Is it possible to use a perceptron to compute the logical "and" and "or"?
We need two inputs \(x_1,x_2\), the weights \(w_1,w_2\) and a threshold.
\(x_1\) | \(x_2\) | \(x_1 \vee x_2\) | \(x_1 \wedge x_2\) |
---|---|---|---|
0 | 0 | 0 | 0 |
0 | 1 | 1 | 0 |
1 | 0 | 1 | 0 |
1 | 1 | 1 | 1 |
You might want to think a while how to set the weights and the threshold to solve this problem, before reading on.
\(x_1 \vee x_2\) should fire (set the output to 1), if at least one of the inputs is set to true in the other case it shouldn't fire. Let's say all weights are 0 and the threshold is 0 as well. We set both \(x_1,x_2\) to 0 and the sum would be 0 which is \(\leq 0\) so it wouldn't fire. That's fine, but because the weights are 0 the output will be always 0. Both "and" and "or" are commutative so it makes sense to set \(w_1 = w_2\).
Try \(w_1=w_2=1\)
\(x_1\) | \(x_2\) | \(\sum_{i=1}^2 w_ix_i\) |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 2 |
That looks good! We can set the threshold to 0 and it works! It's kind of the same for the "and" function. We can use \(w_1=w_2=1\) and a threshold of 1. Might not be perfect cause the sum is exactly the threshold, but, well, cause the definition using \(\leq\) it works properly.
Can we learn an xor as well?
\(x_1\) | \(x_2\) | \(x_1 \veebar x_2\) |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
Think a minute about how to solve this using a perceptron.
\[ w_1 \cdot 0 + w_2 \cdot 0 - \text{threshold} \leq 0 \\ w_1 \cdot 0 + w_2 \cdot 1 - \text{threshold} > 0 \\ w_1 \cdot 1 + w_2 \cdot 0 - \text{threshold} > 0 \\ w_1 \cdot 1 + w_2 \cdot 1 - \text{threshold} \leq 0 \\ \]The "xor" is commutative as well, so we might want to strike out one of the middle lines and set \(w=w_1=w_2\).
\[ 0w - \text{threshold} \leq 0 \\ 1w - \text{threshold} > 0 \\ 2w - \text{threshold} \leq 0 \\ \]Let's set \(t = \text{threshold}\)
\[ \begin{aligned} 0 & \leq t \\ w & > t \\ 2w & \leq t \end{aligned} \] \(w > t \underset{t \geq 0}{\implies} w > 0 \underset{w>t}{\implies} 2w > t \quad \Rightarrow\!\Leftarrow\)So it isn't possible to learn "xor" using a single perceptron. We can simply combine several perceptrons to solve this problem.
\(x_1 \veebar x_2 = \neg (x_1 \wedge x_2) \wedge (x_1 \vee x_2)\)We already have a perceptron for the "and" in the middle and the "or" at the end. The next thing will be a "nand".
As you can see, I've changed the style of the graph. You can now see the weights and the threshold directly.
The number inside the circle is the bias \(b\).
\(b := -\text{threshold}\)Using the bias \(b\), the output function looks like this:
\[ output = \begin{cases} 1 & \text{if } \sum_i w_ix_i + b > 0 \\ 0 & \text{else} \end{cases} \]This has the advantage that we have a \(0\) on the right and we can imagine an extra input \(x_0\) which fires always (\(x_0 = 1\)) and has a weight \(w_0 = b\), so it would be possible to integrate the \(b\) into the sum.
I said I will not use a hidden layer in the first article so I'm sorry to use a hidden layer here, but I hope it's clear how to combine some perceptrons. It isn't hard to get the output but it's hard to code how to learn a hidden layer so I'll explain that in a future article. BTW it's called hidden layer cause it's neither the input nor the output so it's kind of hidden.
Back to the "xor" graph:
The top part of that graph represents the "nand", the bottom part the "or" and the part on the right is the "and".
\(x_1\) | \(x_2\) | \(n = -2x_1+(-2)x_2+3 > 0\) | \(o = x_1+x_2+0 > 0\) | \(n+o-1 > 0\) |
---|---|---|---|---|
0 | 0 | 1 | 0 | 0 |
0 | 1 | 1 | 1 | 1 |
1 | 0 | 1 | 1 | 1 |
1 | 1 | 0 | 1 | 0 |
I used a \(1\) to represent that the column inequality is satisfied.
The last column represents the output of the "xor".
The next step will be to think about how to learn the weights and the threshold. I said at the beginning of the paragraph about perceptrons, that perceptrons aren't state of the art, but why?
One of the most important things about learning is that a small difference in a weight should only change the output by a small difference.
\(x_1\) | \(x_2\) | \(n = -2x_1+(-2)x_2+3 > 0\) | \(o = x_1+x_2+\) 0.1 \( > 0\) | \(n+o-1 > 0\) |
---|---|---|---|---|
0 | 0 | 1 | 1 | 1 |
0 | 1 | 1 | 1 | 1 |
1 | 0 | 1 | 1 | 1 |
1 | 1 | 0 | 1 | 0 |
I've changed the last "or" bias from \(0\) to \(0.1\) and that changed the total output from a "xor" to a "nand".
A sigmoid neuron supports non binary inputs and outputs. That's kind of the whole change.
To define a sigmoid neuron we need to have a look at the sigmoid function first.
\[ \sigma(x) = \frac{1}{1+e^{-x}} \]We need a function that has these two properties:
It has to be smooth
The values should be between 0 and 1
to achieve our goal, that a small change in a weight or bias should only change the output. It is desirable to have kind of a binary output (values between 0 and 1). When training handwritten digits we will use 10 outputs. Each represents one digit and the values show how likely it is, that the representation of the handwritten digit is guessed correctly.
There are some different functions with these properties, but we will use the sigmoid function, cause it is very common.
To compute the output of a sigmoid neuron the function looks like this:
\[ output = \sigma\big(\sum_i w_ix_i+b\big) \]Now if we change a weight or a bias a bit, the output will change a bit as well. It's possible to prove it, but I think it isn't that important here.
You might have recognized that, if we use a sigmoid neuron to compute a logical function like "xor" the output will never be exactly 0 or 1. But that isn't a problem, we can simply round the output of the output layer. Rounding the sigmoid function will be exactly the same thing like using a perceptron, but we need the non binary output to train the neuron in an efficient way.
In the next blog entry I explain how to learn some small stuff, using a single sigmoid neuron. There will be a code part, which will be written in Python, the most readable programming language, that I know.
Michael Nielsen explains how to learn handwritten digits using a neural net here: neuralnetworksanddeeplearning.com/chap1.html
I've read it and he explains neural nets in more detail, but therefore it is more complex and harder to read. It might be interesting for some of you.