Deep learning (DL) is a tool which we can apply to solve problems computationally. The inner workings are inspired by the neurons in our brain, replicating this in a simplified way.

In any scenario where you can identify a pattern, DL can probably be trained to do the same thing. From simple things like separating images of cats and dogs to predicting the risk of cancer in medical imaging.

DL specifically has a strength which sets it apart from other machine learning methods, that is to identify features autonomously. What this means is when we’re writing our code to identify cats and dogs, we don’t need to specify the shape cat’s eyes and other specific features. We just train our model with a variety of images of cats and dogs in all different conditions. DL will extract this feature from the training data, and many other distinguishing features based on how you build your model.

So far I’ve only mentioned images, but DL works well on text (natural language processing) and audio.

How It works

A Neuron & Neural Networks

One method to extract complex patterns from data is by using a Neural Network (NN) which can be visualised in the image below. We read this from left to right, each column of circles represents a layer of neurons or nodes (circles). Data comes into the input layer then each node in the hidden layers transforms the data to reach some desired output.

A simple neural network with 1 hidden layer. Each circle represents a node, arrows indicate data flow.

A simple neural network with 1 hidden layer. Each circle represents a node, arrows indicate data flow.

Let’s take a closer look at the nodes as shown in the image below. These are pretty simple, taking the form y = wx + b where x is the input, w is a weight, b is a bias and y is the output. For a node which takes n inputs, we have n number of x and w. The only variables in a node are the weight and bias, both fine-tuned to get closer to the target output. But even when you combine multiple nodes of the form y = wx + b, the result is a single straight line. Which is not much use compared to a simple line of best fit, as we are limited to linear patterns.

To improve this we can model non-linear patterns by using an activation function (AF) to each node’s output, indicated by the key in the diagram below. The AF is just a mapping which decides how significant the output of the node will be. For a NN each layer will use the same AF, and often the whole NN uses the same AF.

The NN learns by adjusting the weight and bias of each node, to separate two groups for classification or output a numeric value for regression.

The inner workings of a node. Starting with the two inputs, x1, x2 which are multiplied by their corresponding weights. Then summed together at the next stage, +. After that the bias, b will shift the summed value. Sending the shifted value to the activation function (key) and the output here will be the neuron’s output

The inner workings of a node. Starting with the two inputs, x1, x2 which are multiplied by their corresponding weights. Then summed together at the next stage, +. After that the bias, b will shift the summed value. Sending the shifted value to the activation function (key) and the output here will be the neuron’s output

Deep Neural Network

To go from a simple NN to a deep neural network (DNN), we need 2 or more hidden layers as shown in the diagram below. This would be one of the most simple NN required to perform deep learning.

A simple DNN, used for deep learning. With 2 hidden layers.

A simple DNN, used for deep learning. With 2 hidden layers.

To visualise what happens inside a DNN the image below shows the same DNN above, applied to a simple dataset with 2 distinct clusters of data points. For such a simple distribution, the simple line of best fit is the optimal solution.

Graphic representation of a DNN. The blue and green circles are data points used to train the model(need to define model). The coloured regions represent what the model has learned.

Graphic representation of a DNN. The blue and green circles are data points used to train the model(need to define model). The coloured regions represent what the model has learned.

Using the same DNN structure, as shown above, this time applied to a more complex doughnut-shaped distribution shown below. Using the linear option is the same as no AF because DNN can only find straight lines we get no solutions to separate blue and orange. The ReLU AF does find a solution which works well, notice how the shape is made up of straight lines. This shows the nature of the ReLU AF, which is just like 2 linear functions. And for Tanh AF which is a non-linear function, we can see a curved shape.

The results from training DNN with different activation functions. Linear is the same as no activation function.

The results from training DNN with different activation functions. Linear is the same as no activation function.

Looking at a very complex data distribution, a ‘Swirl’. Now we can see our DNN can’t find a solution with any of the AFs. As such a complex distribution requires a more complex DNN, so more nodes and layers.

A simple DNN applied to a complex distribution

A simple DNN applied to a complex distribution

The DNN below shows significantly more nodes and layers, this is sufficient to find the complex swirl pattern. But using this many nodes will also require a lot more computing power. This particular distribution required some experimentation ad the Tanh AF could not find a solution, but the ReLU AF works.

A complex DNN applied to a complex distribution

A complex DNN applied to a complex distribution