Deep learning (DL) is a tool which we can apply to solve problems computationally. The inner workings are inspired by the neurons in our brain, replicating this in a simplified way.
In any scenario where you can identify a pattern, DL can probably be trained to do the same thing. From simple things like separating images of cats and dogs to predicting the risk of cancer in medical imaging.
DL specifically has a strength which sets it apart from other machine learning methods, that is to identify features autonomously. What this means is when we’re writing our code to identify cats and dogs, we don’t need to specify the shape cat’s eyes and other specific features. We just train our model with a variety of images of cats and dogs in all different conditions. DL will extract this feature from the training data, and many other distinguishing features based on how you build your model.
So far I’ve only mentioned images, but DL works well on text (natural language processing) and audio.
How It works
A Neuron & Neural Networks
One method to extract complex patterns from data is by using a Neural Network (NN) which can be visualised in the image below. We read this from left to right, each column of circles represents a layer of neurons or nodes (circles). Data comes into the input layer then each node in the hidden layers transforms the data to reach some desired output.
Let’s take a closer look at the nodes as shown in the image below. These are pretty simple, taking the form y = wx + b
where x
is the input, w
is a weight, b
is a bias and y
is the output. For a node which takes n
inputs, we have n
number of x
and w
. The only variables in a node are the weight and bias, both fine-tuned to get closer to the target output. But even when you combine multiple nodes of the form y = wx + b
, the result is a single straight line. Which is not much use compared to a simple line of best fit, as we are limited to linear patterns.
To improve this we can model non-linear patterns by using an activation function (AF) to each node’s output, indicated by the key in the diagram below. The AF is just a mapping which decides how significant the output of the node will be. For a NN each layer will use the same AF, and often the whole NN uses the same AF.
The NN learns by adjusting the weight and bias of each node, to separate two groups for classification or output a numeric value for regression.
Deep Neural Network
To go from a simple NN to a deep neural network (DNN), we need 2 or more hidden layers as shown in the diagram below. This would be one of the most simple NN required to perform deep learning.
To visualise what happens inside a DNN the image below shows the same DNN above, applied to a simple dataset with 2 distinct clusters of data points. For such a simple distribution, the simple line of best fit is the optimal solution.
Using the same DNN structure, as shown above, this time applied to a more complex doughnut-shaped distribution shown below. Using the linear
option is the same as no AF because DNN can only find straight lines we get no solutions to separate blue and orange. The ReLU
AF does find a solution which works well, notice how the shape is made up of straight lines. This shows the nature of the ReLU
AF, which is just like 2 linear functions. And for Tanh
AF which is a non-linear function, we can see a curved shape.
Looking at a very complex data distribution, a ‘Swirl’. Now we can see our DNN can’t find a solution with any of the AFs. As such a complex distribution requires a more complex DNN, so more nodes and layers.
The DNN below shows significantly more nodes and layers, this is sufficient to find the complex swirl pattern. But using this many nodes will also require a lot more computing power. This particular distribution required some experimentation ad the Tanh
AF could not find a solution, but the ReLU
AF works.