Neural networks from scratch in Python

In this post we will implement a simple neural network architecture from scratch using Python and Numpy. We will take a look at the mathematics behind a neural network, implement one in Python, and experiment with a number of datasets to see how they work in practice.

Architecture of a neural network

A neural network is made up of one input layer, one output layer, and one or more hidden layers. Each layer is made up of multiple nodes. The number of nodes in the input layer is equal to the number of features (i.e. data which we believe explains whatever process we are trying to model/explain) while the size of the output layer is usually equal to the number of classes to which our data points belong (the output layer can also consist of only one node when we are doing regression or binary classification).

The number of nodes in the hidden layer is a hyper parameter which we can pick and choose. How many nodes and how many hidden layers to pick is something of an art (pick based on previous experience, pick what works best across multiple configurations, etc). A network having multiple hidden layers is what's normally called a deep neural network. However, it's important to keep in mind that blindly adding multiple layers it's not guaranteed to improve performance.

The simple network pictured above maps four dimensional points (four features) to three possible classes (three output nodes).

The forward pass

You can think of a neural network as a function that maps arbitrary inputs to arbitrary outputs. One computation of this function is usually called a forward pass. The network takes in some input via the input layer, moves this data forward through the layers and arrives at an output. How this computation works is what we'll look at now.

We will work with a neural network with a single hidden layer much like the one pictured above.

Input -> Hidden

Each node in the hidden layer holds a value representing an arbitrary combination of the nodes in the previous layer. For the first node in the hidden layer we can define this value as:

$z_1 = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n$

where $$n$$ is the number of nodes in the input layer and $$x$$ the value of each input node. Note this is nothing but a simple linear combination of all input values. We can define this in general form for each hidden layer node $$k$$ as:

$z_k = w_{1,k} x_1 + w_{2,k} x_2 + \cdots w_{n,k} x_n$

We can define a matrix $$W_{n \times d}$$ and use it to store all our $$w_{i,j}$$ values as follows:

$W = \begin{pmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1, d} \\ w_{2,1} & w_{2,2} & \cdots & w_{2, d} \\ \vdots & \vdots & \vdots & \vdots \\ w_{n,1} & w_{n,2} & \cdots & w_{n, d} \end{pmatrix}$

The $$w_{i,j}$$ values are called weights. They represent the "strength" of the connection between node $$i$$ (node on the left) and node $$j$$ (node on the right). If you look at the neural network figure in the beginning each $$w$$ represents an arrow (going from node i to node j).

This weights matrix $$W$$ holds a column for each node in the hidden layer (i.e. we have $$d$$ nodes in the hidden layer) and a line for each node in the input layer (i.e. we have $$n$$ nodes in the input layer). We can use each column in the weights matrix to compute those $$z$$ values above. Defining the input layer as a column vector $$x$$ we can compute all those $$z$$ values at once as follows:

$z = \begin{pmatrix} z_{1} \\ z_{2} \\ \vdots \\ z_{d} \end{pmatrix} = W^T x = \begin{pmatrix} w_{1,1} & w_{2,1} & \cdots & w_{n, 1} \\ w_{1,2} & w_{2,2} & \cdots & w_{n, 2} \\ \vdots & \vdots & \vdots & \vdots \\ w_{1,d} & w_{2,d} & \cdots & w_{n, d} \end{pmatrix} \begin{pmatrix} x_{1} \\ x_{2} \\ \vdots \\ x_{n} \end{pmatrix}$

Note we take the transpose of the weights matrix because we would like to do a dot product between each column in the untouched weights matrix and the column vector $$x$$. Transposing $$W$$ allows us to express this as a simple matrix multiplication.

In Python you would do this operation as follows:

import numpy as np

x = np.random.randn(n, 1)
W = np.random.randn(n, d)

z = W.T.dot(x)


The bias term

Something not mentioned until now is the bias term. A bias value is just another number we add to our $$z$$ value above. If you're familiar with linear regression (or just lines in general) you might have seen lines expressed as:

$y = wx + b$

That $$b$$ value is the bias term we are now talking about. It allows a linear function to shift. Learning the $$w$$ parameter allows us to change the steepness (slope) of the line we are learning. Learning a bias term as well, allows us to also shift the function up or down and thus produce a better model for our data.

In our case we simply produce another column vector $$b$$ and add it to $$z$$.

import numpy as np

x = np.random.randn(n, 1)
W = np.random.randn(n, d)
b = np.random.randn(d)

z = W.T.dot(x) + b


Going from one sample to multiple samples

In the Python code above we fill x with some random values for demonstration purposes. x however is data you would observe in the real world. If for example, we are trying to model house prices, then this x would hold values representing features of a house which we believe explain its price. These could be size in square feet, year of construction, and anything else we can think of. Each such value would be one element in the vector x. Normally we would have multiple such observations so we would have multiple such x vectors.

In practice, x is going to be a matrix holding all of our observations. The column vector x above would therefore be a line in the X data matrix containing all of our observations. The computations are going to look a bit different. The column vector x, now becomes the following X matrix:

$X_{s \times n} = \begin{pmatrix} x_{1, 1} & x_{1, 2} & \cdots & x_{1, n} \\ x_{2, 1} & x_{2, 2} & \cdots & x_{2, n} \\ \vdots & \vdots & \vdots & \vdots \\ x_{s, 1} & x_{s, 2} & \cdots & x_{s, n} \end{pmatrix}$

Each line in this matrix is like the one column vector we had before. In practical terms, we have observed $$s$$ samples (e.g. houses) and for each sample we have measured $$n$$ features (e.g. house size, age, etc). We will now compute the $$z$$ values just like before but we will do it across all samples. In the end, $$z$$ will also become a matrix $$Z$$ rather than just another column vector.

$Z_{s \times d} = X_{s \times n} W_{n \times d}$

$$Z$$ will hold $$s$$ lines, one for each sample, and $$d$$ columns, one for each hidden layer node. In other words, just like with $$x$$ the $$z$$ column vector we had before becomes a line in this new $$Z$$ matrix.

$Z_{s \times d} = \begin{pmatrix} x_{1, 1} & x_{1, 2} & \cdots & x_{1, n} \\ x_{2, 1} & x_{2, 2} & \cdots & x_{2, n} \\ \vdots & \vdots & \vdots & \vdots \\ x_{s, 1} & x_{s, 2} & \cdots & x_{s, n} \end{pmatrix} \begin{pmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1, d} \\ w_{2,1} & w_{2,2} & \cdots & w_{2, d} \\ \vdots & \vdots & \vdots & \vdots \\ w_{n,1} & w_{n,2} & \cdots & w_{n, d} \end{pmatrix}$

$Z = \begin{pmatrix} z_{1, 1} & z_{1, 2} & \cdots & z_{1, d} \\ z_{2, 1} & z_{2, 2} & \cdots & z_{2, d} \\ \vdots & \vdots & \vdots & \vdots \\ z_{s, 1} & z_{s, 2} & \cdots & z_{s, d} \end{pmatrix}$

In Python:

import numpy as np

# s samples, having d features, your data
# random data here for demonstration purposes
X = np.random.randn(s, n)

# in practice the W and b matrices are in fact randomly initialized
W = np.random.randn(n, d)
b = np.random.randn(d)

Z = X.dot(W) + b


Quick note on the bias term:

Note how X.dot(W) produces an $$s \times d$$ matrix yet our bias value $$b$$ is just a d-dimensional vector. Mathematically, that addition X.dot(W) + b does not work. What we actually want to do is think of $$b$$ as a line vector and add this vector to each line in the X.dot(W) matrix. Remember that X.dot(W) has d columns, so each line is just another d-dimensional line vector which we can add with our d-dimensional b vector.

Luckily Numpy does this broadcasting automatically and the computation works in code. In math we would have to create a matrix B which would have s lines and each line would be the d-dimensional vector b. Then we would add B (uppercase) rather than b (lowercase) to the X.dot(W) matrix.

As a quick recap, we started with a single observed sample, the column vector $$x$$, did some computations with that, then moved on to the multi sample computation, the now matrix $$X$$ and generated the matrix $$Z$$. Are we done? Not quite.

The power of neural networks comes from their ability to model complex non-linear relationships between data points. What we did up to this point is nothing but a linear computation. This by itself cannot model the non-linear dependencies we are interested in.

This is where activation functions come into play.

The sigmoid activation

The sigmoid function is defined as:

$\sigma(x) = \frac{1}{1 + e^{-x}}$

This function produces values between 0 and 1 and has an s-shaped plot. It looks like this:

You can define the sigmoid function in Python like so:

import numpy as np

def sigmoid(x):
return 1 / (1 + np.exp(-x))


and to produce the plot above:

import matplotlib.pyplot as plt

x = np.linspace(-6, 6, 100)
plt.plot(x, sigmoid(x))
plt.show()


Numpy will do the computations elementwise so you can pass the sigmoid function a vector, a matrix, or even just a number (a scalar), and it will work fine.

This activation function will be applied to the $$Z$$ matrix we defined above to create the "activation" values of each neuron.

Thus we have finally arrived at the end of the first step in the forward pass, the input to hidden computation. This will be expressed as follows (again keep in mind the quick note above regarding the bias term):

$A = \sigma(Z) = \sigma(XW + b)$

and in Python:

import numpy as np

def sigmoid(x):
return 1 / (1 + np.exp(-x))

X = np.random.randn(s, n)

W = np.random.randn(n, d)
b = np.random.randn(d)

Z = X.dot(W) + b
A = sigmoid(Z)


The $$A$$ matrix will hold the activation values for each node in the hidden layer across all samples. This is in fact the computed hidden layer. These computed values will move forward in the network.

Note the sigmoid activation function is just one popular function used in neural networks. Other popular functions are the hyperbolic tangent function, tanh, and the ReLU function (used a lot in convolutional neural networks), but there can be others. These functions are used a lot because they have some nice mathematical properties (easy to take the derivative of) and because they produce good empirical results.

Hidden -> Output

Now that we have computed the hidden layer it's time to move forward in the network and arrive at the final layer, the output layer.

The computations done at this step are very much similar to the ones done in the previous step. There are only two differences. The first is that now instead of the input data $$X$$ we are working with the hidden layer values $$A$$ and the second is that the activation function is a bit different (and not like any of the others we have mentioned in passing either).

What we are trying to compute

Our hope is that once it is trained, our neural network will be able to make accurate predictions. Given some observed data we would like to predict to which class the item described by that data belongs. In our houses example we might want to place houses in categories such as higher class, middle class, or lower class. If we are working with images of people we might want to guess to which person the image belongs. These are all classifications problems which neural networks are quite good at solving.

A one-hot encoding (or vector) is what we use to numerically represent a class. A one-hot vector is a vector which has a single element set to one and the remaining elements set to zero. An example of one such vector:

$y = \begin{pmatrix} 0 \\ 1 \\ 0 \end{pmatrix}$

If we have three classes to which our data belongs we can represent these classes with the following matrix:

$Y = \begin{pmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{pmatrix}$

Each line in this matrix represents a distinct class. Whenever we observe data which belongs to the first class we simply add another [1, 0, 0] line to our observed values. If we observe data belonging to the third class we add another [0, 0, 1] line to our observed values and so on.

In the end we would like our neural network to produce vectors which are as close as possible to these target one-hot vectors. So if we have some features, and we know they describe an item belonging to the first class, we would like our neural network to produce an output layer having values looking something like:

$y = \begin{pmatrix} 0.98 \\ 0.01 \\ 0.01 \end{pmatrix}$

This vector is very close to the targeted [1, 0, 0] vector of interest. To obtain values looking like that we will use the softmax activation function.

Softmax activation

To compute the values associated with the output nodes we will first use the same linear combination we used in the previous step. Given the hidden layer activation values $$A$$, a weights matrix $$W^{(2)}$$ describing the connections between the hidden layer and the output layer, and a bias vector $$b^{(2)}$$ we can compute a new set of $$Z^{(2)}$$ values like so:

$Z^{(2)} = AW^{(2)} + b^{(2)}$

In code:

# A computed in the previous step is of shape s x d

# W2 is of shape d x c
# i.e. d hidden nodes and c output classes
W2 = np.random.randn(d, c)
b2 = np.random.randn(c)

Z2 = A.dot(W2) + b2


Given a $$K$$ dimensional vector $$x$$, the softmax activation function is defined as:

$softmax(x_j) = \frac{e^{x_j}}{\sum_k^K e^{x_k}}$

for $$j = 1, \cdots, K$$

In other words, this function takes a vector, and squashes (or normalizes) each number inside the vector to a value between 0 and 1. If you ignore the exponentiation for a bit all this function does is divide each vector element by the sum of all vector elements. It does this however by taking the exponent of each vector element. This produces a new vector of elements between 0 and 1 that have the property of summing up to 1.

In essence, this softmax function produces a probability distribution. Given a set of numbers it will assign higher probabilities to the higher numbers and lower probabilities to the lower numbers. A numerical example:

$softmax( \begin{pmatrix} 5 \\ 1 \\ 1 \end{pmatrix} ) = \begin{pmatrix} 0.96466316 \\ 0.01766842 \\ 0.01766842 \end{pmatrix}$

$0.96466316 + 0.01766842 + 0.01766842 \approx 1$

In Python you can define this as follows:

import numpy as np

def softmax(A):
expA = np.exp(A)
return expA / expA.sum(axis=1, keepdims=True)


Note how the softmax result looks a lot like the result we said we would like the neural network to produce. The idea is that if the neural network produces values which are good in magnitude (i.e. larger numbers for correct classes and smaller numbers for incorrect ones) the softmax function will squash those values to something looking a lot like the one-hot vector we want to predict.

The final neural network output computation in code:

import numpy as np

def sigmoid(x):
return 1 / (1 + np.exp(-x))

def softmax(A):
expA = np.exp(A)
return expA / expA.sum(axis=1, keepdims=True)

# a 3 x 4 x 3 network example
s = 5 # five samples
n = 3 # three features per sample
d = 4 # four nodes in the hidden layer
c = 3 # three classes to predict

X = np.random.randn(s, n)

W = np.random.randn(n, d)
b = np.random.randn(d)

Z = X.dot(W) + b
A = sigmoid(Z)

W2 = np.random.randn(d, c)
b2 = np.random.randn(c)

Z2 = A.dot(W2) + b2
Y = softmax(Z2)


The backward pass (training)

After we compute the first forward pass of our neural network the results are gonna be quite bad. Random weights will produce random results. However, before we figure out how to improve our network we need to figure out how wrong we are in our computations. We do this via the loss (or error) function. This function quantifies how bad our current results are.

The negative log likelihood

If we have $$s$$ training samples and $$c$$ classes then the loss for our prediction $$y$$ with respect to the true labels $$t$$ is given by:

$L(t, y) = - \frac{1}{s}\sum_i^s \sum_j^c t_{i,j} log y_{i,j}$

While the function may seem somewhat complicated it is actually not doing very much. It is taking an average across samples of the product between the log of our predicted values and the target values. If that seems a bit like a mouthful let's read it from right to left.

The $$t \times log (y)$$ part multiplies our predicted values by the target values. If you ignore the log part for a bit what this does is the element-wise multiplication of two vectors which, if you remember from the previous sections, should look something like t = [1, 0, 0] and, hopefully, y = [0.98, 0.01, 0.01].

If our prediction is very close to 1 then log of that number will be very close to zero which means our error for that particular case will be very close to zero. If we maintain this performance across samples then the average error is also going to be very close to zero.

The minus sign at the begining is there because log of a number between zero and one is negative and we would like to work with positive values so as to minimize the function (i.e go from values above zero to values as close as possible to zero).

Finally we reach one of the more interesting parts of a neural network: the learning process. At this point we have one forward pass done, and we can compute how bad our neural network is using the negative log likelihood function. It's time to change our parameters so that on the next forward pass the neural network does better.

The backpropagation step involves the propagation of the neural network's error back through the network. Based on this error the neural network's weights can be updated so that they become better at minimizing the error.

This is the more math heavy part of a neural network. I will cover the math to some degree but I will add many links from across the web which cover this in more detail.

Gradient descent is an algorithm for iteratively finding the minimum of a function. Starting from a random point the algorithm takes small steps in the opposite direction of the gradient of the function at that point until it eventually reaches the minimum value of that function. This may seem a bit complicated but it's not that bad.

In math notation, if we are given a continuous function $$f(x)$$ and an initial random $$x_0$$ value, then we can find a "better" $$x_1$$ value (better in that the value of $$f$$ at $$x_1$$ is smaller, i.e. we are getting closer to the minimum) by taking a new $$x_1$$ as follows:

$x_{n} = x_{n-1} - \alpha \frac{\partial f}{\partial x_{n-1}}$

where $$\alpha$$ is called the learning rate and usually takes values between $$10^{-1}$$ and $$10^{-5}$$.

Note that the algorithm is in no way guaranteed to find the global minimum of the function (in fact it is highly unlikely that it will do so if the function has multiple local minima), nor is it guaranteed to even find a solution at all. While the expectation is that the solution will improve at every step it is possible for the solutions to diverge and for the function to take ever increasing values (for example, this often happens when the learning rate is set too high).

Let's take a quick look at a simple numerical example to see this computation in practice. Let $$f$$ be the function

$f(x, y) = x^2 + y^2$

then

$\frac{\partial f}{\partial x} = 2x$

and

$\frac{\partial f}{\partial y} = 2y$

This function has a global minimum at $$f(0,0) = 0$$. In code, we can define this function, along with the gradient descent computations, as follows:

import numpy as np

f = lambda x: (x**2).sum()

np.random.seed(0)

# random 2 dimensional vector
# with seed 0 it takes the values [0.54, 0.71]
x = np.random.rand(2, 1)
# learning rate
alpha = 0.01

for i in range(1000):
x = x - alpha * f_grad(x)

print(x) # aproximately (0, 0)
print(f(x)) # around 2.3e-18 so aproximately 0


Applying gradient descent to our neural network is somewhat more involved in terms of the calculus required but the basic principles are the same. We have a loss function defined and the parameters of this function are the weights and biases of our neural network. So we need to find the weights of our neural network such that the value of our loss function is minimized. Just like we did in the simple example above what we have to do now is take the gradient of our loss function with respect to our parameters.

There are four derivatives we need to compute: $$\frac{\partial L}{\partial W^{(2)}}, \frac{\partial L}{\partial b^{(2)}}, \frac{\partial L}{\partial W}, \frac{\partial L}{\partial b}$$. These have the following formulae:

$\frac{\partial L}{\partial W^{(2)}} = \frac{\partial L}{\partial Z^{(2)}} \frac{\partial Z^{(2)}}{\partial W^{(2)}}$

This is the derivative of our loss function with respect to our second set of weights, the ones corresponding to the hidden to output connections. On the right hand side we see it expressed via the chain rule. We will denote the first derivative in the chain as $$\delta^{(2)} = \frac{\partial L}{\partial Z^{(2)}}$$ and leave it at that for now.

The second derivative is much easier to compute. Remember that $$Z^{(2)} = AW^{(2)} + b^{(2)}$$. Taking the derivative of $$Z^{(2)}$$ with respect to $$W^{(2)}$$ leaves us with $$A$$.

So what about that $$\delta^{(2)}$$? You can think of it as the "error" of the nodes in its layer (our output layer in this case). Denoting that derivative as $$\delta^{(2)}$$ is also useful because it allows us to express derivatives across layers in a recursive fashion. We will this in action later. Andrew Ng's famous coursera course is a great resource on how this works. I will link further resources regarding these derivations at the end of this section but for now let's just see the math formulas we have to implement in code.

$\delta^{(2)} = Y - T$ $\frac{\partial L}{\partial W^{(2)}} = A^{T} \delta^{(2)}$

While the notation is a bit different you can see how $$\delta^{(2)}$$ is computed in this great math.stackexchange answer. The only difference is we are using matrix operations to express all the computations at once.

At this same layer we also have:

$\frac{\partial L}{\partial b^{(2)}} = \delta^{(2)}$

Again, remember that $$Z^{(2)} = AW^{(2)} + b^{(2)}$$. Taking the derivative of $$Z^{(2)}$$ with respect to $$b^{(2)}$$ leaves us with $$1$$. So then:

$\frac{\partial L}{\partial b^{(2)}} = \frac{\partial L}{\partial Z^{(2)}} \frac{\partial Z^{(2)}}{\partial b^{(2)}} = \delta^{(2)}$

Moving on to the first set of weights, we have ($$\odot$$ denoting element wise multiplication):

$\delta^{(1)} = \delta^{(2)} W^{(2)T} \odot \sigma^{'}(Z)$ $\delta^{(1)} = \delta^{(2)} W^{(2)T} \odot \sigma(Z) \odot (1 - \sigma(Z))$ $\delta^{(1)} = \delta^{(2)} W^{(2)T} \odot A \odot (1 - A)$

and

$\frac{\partial L}{\partial W} = X^{T} \delta^{(1)}$ $\frac{\partial L}{\partial b} = \delta^{(1)}$

Finally, to put it all in one place, what we are working with is the following:

$\delta^{(2)} = Y - T$ $\delta^{(1)} = \delta^{(2)} W^{(2)T} \odot A \odot (1 - A)$ $\frac{\partial L}{\partial W^{(2)}} = A^{T} \delta^{(2)}$ $\frac{\partial L}{\partial b^{(2)}} = \delta^{(2)}$ $\frac{\partial L}{\partial W} = X^{T} \delta^{(1)}$ $\frac{\partial L}{\partial b} = \delta^{(1)}$

Before moving on with this article here are a number of links which cover these derivations in some greater detail:

• Ng's Machine Learning Course link
• Backpropagation with softmax cross entropy link
• Derivative of softmax loss function link
• Implementing a neural network from scratch link
• Vector, Matrix, and Tensor Derivatives link
• How the backpropagation algorithm works link
• Principles of training multi-layer neural network using backpropagation link

A neural network in practice

We are going to give our neural network data points belonging to three classes. We would like our network to figure out which point belongs to which class. First, let's generate some data we can visualize:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1)

# generate three Gaussian clouds each holding 500 points
X1 = np.random.randn(500, 2) + np.array([0, -2])
X2 = np.random.randn(500, 2) + np.array([2, 2])
X3 = np.random.randn(500, 2) + np.array([-2, 2])

# put them all in a big matrix
X = np.vstack([X1, X2, X3])

# generate the one-hot-encodings
labels = np.array([0]*500 + [1]*500 + [2]*500)
T = np.zeros((1500, 3))
for i in range(1500):
T[i, labels[i]] = 1

# visualize the data
plt.scatter(X[:,0], X[:,1], c=labels, s=100, alpha=0.5)
plt.show()


This produces the following chart:

Then, transcribing the formulas above to Python, we get the following neural network:

samples = X.shape[0] # 1500 samples
features = X.shape[1] # 2 features
hidden_nodes = 5
classes = 3

# randomly initialize weights
W1 = np.random.randn(features, hidden_nodes)
b1 = np.random.randn(hidden_nodes)
W2 = np.random.randn(hidden_nodes, classes)
b2 = np.random.randn(classes)

alpha = 10e-6
costs = []
for epoch in range(10000):
# forward pass
A = sigmoid(X.dot(W1) + b1) # A = sigma(Z)
Y = softmax(A.dot(W2) + b2) # Y = softmax(Z2)

# backward pass
delta2 = Y - T
delta1 = (delta2).dot(W2.T) * A * (1 - A)

W2 -= alpha * A.T.dot(delta2)
b2 -= alpha * (delta2).sum(axis=0)

W1 -= alpha * X.T.dot(delta1)
b1 -= alpha * (delta1).sum(axis=0)

# save loss function values across training iterations
if epoch % 100 == 0:
loss = np.sum(-T * np.log(Y))
print('Loss function value: ', loss)
costs.append(loss)


Plotting the costs (loss function) via:

plt.plot(costs)
plt.show()


produces the following plot:

We can see how the neural network gets better and better and how the total loss decreases over time. The classification rate ends up being around 97%. This code is also available as a github gist here.

Some closing words

We've seen a neural network in practice, discussed the mathematics behind it, and implemented one in Python using basic Numpy arrays and operations. This is a good starting point in terms of understanding neural networks. There are however a number of things which we have not discussed.

First of all, a neural network is never trained on all of the data that you have. The data is usually split into training and testing sets. You train on the training set and then test the network on the test set. This has the network make predictions on data it has never seen. A good neural network is only as good as its performance on the test set.

Second of all, neural networks, like most models, are prone to overfitting. This is when your model "memorizes" the data rather than learn from it. An overfit model does not generalize well and will perform poorly on new unseen data (or at least not as well as it should). Some examples of methods which are usually used to prevent overfitting include regularization and dropout. Regularization includes adding part of the weights back to the new weights after the gradient descent update. Dropout drops some nodes out of the training update across epochs. This is particularly useful in large deep networks.

Early stopping is another method which is used to prevent overfitting and it's quite easy to implement. All you have to do is prepare a separate data set, called a validation set, and then run your training as usual on the training set. As you train your network however, you also test the network once every n number of steps on the validation set. When the performance on the validation set stops improving you stop training.

Third of all, a neural network is only going to be as good as the data it is trained on. Large deep neural networks require large amounts of data in order to be able to learn anything. Input data cannot be simply collected and passed to a neural network. Most data will require cleansing and normalization. Normalization is where you shift your data from a range like [-328094, 82932] to something like [-1, 1]. Remember how, for example, the sigmoid activation returns values between 0 and 1 but changes most only for input values between -6 and 6. If you pass this activation function large numbers it will only produce values of 1. A neuron will not be able to learn much from that.

Fourth, and finally, the gradient descent algorithm described here is only the most basic among them. What we have worked with is called batch gradient descent. At every update step we have looked at all of our training examples and updated all of our weights. When you have millions or billions of training samples these computations become quite intensive. Smarter variations include stochastic gradient descent or mini batch gradient descent. With these variations you only look at a small part of your training examples at each step (or even just a single example). Furthermore, the update equation itself is still only in its most basic form. Modern variations include Nesterov momentum, or the Adam algorithm. You can see a brief discussion of these methods here.

Sep 21, 2016