How do we make a perceptron learn to perform certain tasks? In this chapter, we are going behind the scenes of the “Perceptron Learning Institute”. Keep scrolling!
In the last chapter, we learned that a perceptron can take in some inputs and spit out an output. How can we make it learn to perform tasks? To understand the learning process, we will first discuss what the perceptron is abstractly.
At a high level, the perceptron can be seen as a mathematical function that takes in some inputs, some weights, and some biases, and mixes them together to generate an output.
Since we have no control over the inputs (we cannot change the data that we are learning from), we will try to change weights and bias so that the perceptron function will produce an ideal output for all the input we stuffed into it.
The core of “training” a perceptron is to let it understand how far its actions or predictions are from the ground truth (desired results). To achieve this, we use a loss function to measure the performance of our perceptron. A commonly used loss function is called cross entropy, which is very good at tasks like classifying images. An ideal training algorithm modifies the weights and biases so the loss decreases.
Here, the loss is visualized as the difference between the output (triangle) and the desired ground truth output (rectangle). Training the perceptron is the process of decreasing this difference. Once the loss is decreased to its minimum, the perceptron completes its learning.
There are many loss functions, but all share a common behavior: the further off the prediction is from the ground truth, the larger the output of the loss function is. Minimizing the loss by changing the weights and biases means our perceptron will gradually improve until it reaches its limit.
To minimize loss, we use an optimizer called Gradient Descent. The "hill" here depicts the relationship between the weights and bias of a perceptron and its loss. The higher up on the hill we are, the greater the loss is. Gradient Descent is a technique that modifies the weights and biases to walk “downhill”, and the optimizer finds the steepest way downwards.
First, the Gradient Descent optimizer would look at the "hill" with respect to the data, and generate gradients, providing a direction downhill. Then, the optimizer picks a learning rate to decide how far to walk before stopping and redetermining the gradients. After many repetitions, it slips its way down to a place where it can no longer go downhill (minimum loss).
By now, we have learned what it means for the perceptron to learn, and how we can train the perceptron. Let us venture into more complex neural network structures. We will discuss how we can adapt perceptrons to a type of multilayer networks called the feedforward neural network.
Loss Function: A function that determines the loss,
the value that represents how far the output predictions are from the
Gradient: A group of numbers indicating which direction would be the “steepest” way to minimize loss (downhill).
Gradient Descent: An optimizer that uses gradients to adjust the weights and biases to minimize loss.