Chapter 4

How do we make a perceptron learn to perform certain tasks? In this chapter, we are going behind the scenes of the “Perceptron Learning Institute”. Keep scrolling!

Scroll

In the last chapter, we learned that a perceptron can take in some inputs and spit out an output. How can we make it learn to perform tasks? To understand the learning process, we will first discuss what the perceptron is abstractly.

At a high level, the perceptron can be seen as a mathematical function that takes in some inputs, some weights, and some biases, and mixes them together to generate an output.

Since we have no control over the inputs (we cannot change the data that we are learning from), we will try to change weights and bias so that the perceptron function will produce an ideal output for all the input we stuffed into it.

The core of “training” a perceptron is to let it understand how far its actions or
predictions are from the ground truth (desired results). To achieve this, we use a
**loss
function** to measure the performance of our perceptron. A commonly used
loss function is called cross entropy, which is very good at tasks like classifying
images. An ideal training algorithm modifies the weights and biases so the loss
decreases.

Here, the loss is visualized as the difference between the output (triangle) and
the
desired ground truth output (rectangle). Training the perceptron is the process of
decreasing this difference. Once the loss is decreased to its
**minimum**,
the perceptron
completes its learning.

There are many loss functions, but all share a common behavior: the further off the prediction is from the ground truth, the larger the output of the loss function is. Minimizing the loss by changing the weights and biases means our perceptron will gradually improve until it reaches its limit.

To minimize loss, we use an **optimizer**
called
**
Gradient Descent**.
The "hill" here depicts the **relationship**
between the weights and bias of a perceptron and its loss. The higher up on the
hill we are, the greater the loss is. Gradient Descent is a technique that modifies
the weights and biases to walk “downhill”, and the optimizer finds the steepest way
downwards.

First, the Gradient Descent optimizer would look at the "hill" with respect to the
data, and generate **gradients**,
providing a direction downhill. Then, the optimizer picks a learning rate to decide
how far to walk before stopping and redetermining the gradients. After many
repetitions, it slips its way down to a place where it can no longer go downhill
(minimum loss).

By now, we have learned what it means for the perceptron to learn, and how we can train the perceptron. Let us venture into more complex neural network structures. We will discuss how we can adapt perceptrons to a type of multilayer networks called the feedforward neural network.

Gradient descent, how neural networks learn

Gradient Descent in a Nutshell

**Loss Function**: A function that determines the loss,
the value that represents how far the output predictions are from the
ground truth.

**Gradient**: A group of numbers indicating which
direction would be the “steepest” way to minimize loss (downhill).

**Gradient Descent**: An optimizer that uses gradients to
adjust the weights and biases to minimize loss.