Machine Learning for Beginners: An Introduction to Neural Networks
A simple explanation of how they work and how to implement one from scratch in Python.
Here’s something that might surprise you: neural networks aren’t that complicated! The term “neural network” gets used as a buzzword a lot, but in reality they’re often much simpler than people imagine.
This post is intended for complete beginners and assumes ZERO prior knowledge of machine learning. We’ll understand how neural networks work while implementing one from scratch in Python.
Let’s get started!
1. Building Blocks: Neurons
First, we have to talk about neurons, the basic unit of a neural network. A neuron takes inputs, does some math with them, and produces one output. Here’s what a 2-input neuron looks like:
3 things are happening here. First, each input is multiplied by a weight:
𝑥1→𝑥1∗𝑤1x1→x1∗w1𝑥2→𝑥2∗𝑤2x2→x2∗w2
Next, all the weighted inputs are added together with a bias 𝑏b:
(𝑥1∗𝑤1)+(𝑥2∗𝑤2)+𝑏(x1∗w1)+(x2∗w2)+b
Finally, the sum is passed through an activation function:
The activation function is used to turn an unbounded input into an output that has a nice, predictable form. A commonly used activation function is the sigmoid function:
The sigmoid function only outputs numbers in the range (0,1)(0,1). You can think of it as compressing (−∞,+∞)(−∞,+∞) to (0,1)(0,1) - big negative numbers become ~00, and big positive numbers become ~11.
A Simple Example
Assume we have a 2-input neuron that uses the sigmoid activation function and has the following parameters:
𝑤=[0,1]w=[0,1]𝑏=4b=4
𝑤=[0,1]w=[0,1] is just a way of writing 𝑤1=0,𝑤2=1w1=0,w2=1 in vector form. Now, let’s give the neuron an input of 𝑥=[2,3]x=[2,3]. We’ll use the dot product to write things more concisely:
(𝑤⋅𝑥)+𝑏=((𝑤1∗𝑥1)+(𝑤2∗𝑥2))+𝑏=0∗2+1∗3+4=7(w⋅x)+b=((w1∗x1)+(w2∗x2))+b=0∗2+1∗3+4=7𝑦=𝑓(𝑤⋅𝑥+𝑏)=𝑓(7)=0.999y=f(w⋅x+b)=f(7)=0.999
The neuron outputs 0.9990.999 given the inputs 𝑥=[2,3]x=[2,3]. That’s it! This process of passing inputs forward to get an output is known as feedforward.
Coding a Neuron
Time to implement a neuron! We’ll use NumPy, a popular and powerful computing library for Python, to help us do math:
import numpy as np
def sigmoid(x):
# Our activation function: f(x) = 1 / (1 + e^(-x))
return 1 / (1 + np.exp(-x))
class Neuron:
def __init__(self, weights, bias):
self.weights = weights
self.bias = bias
def feedforward(self, inputs):
# Weight inputs, add bias, then use the activation function
total = np.dot(self.weights, inputs) + self.bias
return sigmoid(total)
weights = np.array([0, 1]) # w1 = 0, w2 = 1
bias = 4 # b = 4
n = Neuron(weights, bias)
x = np.array([2, 3]) # x1 = 2, x2 = 3
print(n.feedforward(x)) # 0.9990889488055994
Recognize those numbers? That’s the example we just did! We get the same answer of 0.9990.999.
2. Combining Neurons into a Neural Network
A neural network is nothing more than a bunch of neurons connected together. Here’s what a simple neural network might look like:
This network has 2 inputs, a hidden layer with 2 neurons (ℎ1h1 and ℎ2h2), and an output layer with 1 neuron (𝑜1o1). Notice that the inputs for 𝑜1o1 are the outputs from ℎ1h1 and ℎ2h2 - that’s what makes this a network.
A hidden layer is any layer between the input (first) layer and output (last) layer. There can be multiple hidden layers!
An Example: Feedforward
Let’s use the network pictured above and assume all neurons have the same weights 𝑤=[0,1]w=[0,1], the same bias 𝑏=0b=0, and the same sigmoid activation function. Let ℎ1,ℎ2,𝑜1h1,h2,o1 denote the outputs of the neurons they represent.
What happens if we pass in the input 𝑥=[2,3]x=[2,3]?
ℎ1=ℎ2=𝑓(𝑤⋅𝑥+𝑏)=𝑓((0∗2)+(1∗3)+0)=𝑓(3)=0.9526h1=h2=f(w⋅x+b)=f((0∗2)+(1∗3)+0)=f(3)=0.9526𝑜1=𝑓(𝑤⋅[ℎ1,ℎ2]+𝑏)=𝑓((0∗ℎ1)+(1∗ℎ2)+0)=𝑓(0.9526)=0.7216o1=f(w⋅[h1,h2]+b)=f((0∗h1)+(1∗h2)+0)=f(0.9526)=0.7216
The output of the neural network for input 𝑥=[2,3]x=[2,3] is 0.72160.7216. Pretty simple, right?
A neural network can have any number of layers with any number of neurons in those layers. The basic idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at the end. For simplicity, we’ll keep using the network pictured above for the rest of this post.
Coding a Neural Network: Feedforward
Let’s implement feedforward for our neural network. Here’s the image of the network again for reference:
import numpy as np
# ... code from previous section here
class OurNeuralNetwork:
'''
A neural network with:
- 2 inputs
- a hidden layer with 2 neurons (h1, h2)
- an output layer with 1 neuron (o1)
Each neuron has the same weights and bias:
- w = [0, 1]
- b = 0
'''
def __init__(self):
weights = np.array([0, 1])
bias = 0
# The Neuron class here is from the previous section
self.h1 = Neuron(weights, bias)
self.h2 = Neuron(weights, bias)
self.o1 = Neuron(weights, bias)
def feedforward(self, x):
out_h1 = self.h1.feedforward(x)
out_h2 = self.h2.feedforward(x)
# The inputs for o1 are the outputs from h1 and h2
out_o1 = self.o1.feedforward(np.array([out_h1, out_h2]))
return out_o1
network = OurNeuralNetwork()
x = np.array([2, 3])
print(network.feedforward(x)) # 0.7216325609518421
We got 0.72160.7216 again! Looks like it works.
3. Training a Neural Network, Part 1
Say we have the following measurements:
NameWeight (lb)Height (in)GenderAlice13365FBob16072MCharlie15270MDiana12060F
Let’s train our network to predict someone’s gender given their weight and height:
We’ll represent Male with a 00 and Female with a 11, and we’ll also shift the data to make it easier to use:
Recommended by LinkedIn
NameWeight (minus 135)Height (minus 66)GenderAlice-2-11Bob2560Charlie1740Diana-15-61
I arbitrarily chose the shift amounts (135135 and 6666) to make the numbers look nice. Normally, you’d shift by the mean.
Loss
Before we train our network, we first need a way to quantify how “good” it’s doing so that it can try to do “better”. That’s what the loss is.
We’ll use the mean squared error (MSE) loss:
MSE=1𝑛∑𝑖=1𝑛(𝑦𝑡𝑟𝑢𝑒−𝑦𝑝𝑟𝑒𝑑)2
computed 𝐿=(1−𝑦𝑝𝑟𝑒𝑑)2L=(1−ypred)2 above:
∂𝐿∂𝑦𝑝𝑟𝑒𝑑=∂(1−𝑦𝑝𝑟𝑒𝑑)2∂𝑦𝑝𝑟𝑒𝑑=−2(1−𝑦𝑝𝑟𝑒𝑑)∂ypred∂L=∂ypred∂(1−ypred)2=−2(1−ypred)
Now, let’s figure out what to do with ∂𝑦𝑝𝑟𝑒𝑑∂𝑤1∂w1∂ypred. Just like before, let ℎ1,ℎ2,𝑜1h1,h2,o1 be the outputs of the neurons they represent. Then
𝑦𝑝𝑟𝑒𝑑=𝑜1=𝑓(𝑤5ℎ1+𝑤6ℎ2+𝑏3)ypred=o1=f(w5h1+w6h2+b3)
Since 𝑤1w1 only affects ℎ1h1 (not ℎ2h2), we can write
∂𝑦𝑝𝑟𝑒𝑑∂𝑤1=∂𝑦𝑝𝑟𝑒𝑑∂ℎ1∗∂ℎ1∂𝑤1∂w1∂ypred=∂h1∂ypred∗∂w1∂h1∂𝑦𝑝𝑟𝑒𝑑∂ℎ1=𝑤5∗𝑓′(𝑤5ℎ1+𝑤6ℎ2+𝑏3)∂h1∂ypred=w5∗f′(w5h1+w6h2+b3)
We do the same thing for ∂ℎ1∂𝑤1∂w1∂h1:
ℎ1=𝑓(𝑤1𝑥1+𝑤2𝑥2+𝑏1)h1=f(w1x1+w2x2+b1)∂ℎ1∂𝑤1=𝑥1∗𝑓′(𝑤1𝑥1+𝑤2𝑥2+𝑏1)
second time we’ve seen 𝑓′(𝑥)f′(x) (the derivate of the sigmoid function) now! Let’s derive it:
𝑓(𝑥)=11+𝑒−𝑥f(x)=1+e−x1𝑓′(𝑥)=𝑒−𝑥(1+𝑒−𝑥)2=𝑓(𝑥)∗(1−𝑓(𝑥))f′(x)=(1+e−x)2e−x=f(x)∗(1−f(x))
We’ll use this nice form for 𝑓′(𝑥)f′(x) later.
We’re done! We’ve managed to break down ∂𝐿∂𝑤1∂w1∂L into several parts we can calculate:
∂𝐿∂𝑤1=∂𝐿∂𝑦𝑝𝑟𝑒𝑑∗∂𝑦𝑝𝑟𝑒𝑑∂ℎ1∗∂ℎ1∂𝑤1∂w1∂L=∂ypred∂L∗∂h1∂ypred∗∂w1∂h1
This system of calculating partial derivatives by working backwards is known as backpropagation, or “backprop”.
Phew. That was a lot of symbols - it’s alright if you’re still a bit confused. Let’s do an example to see this in action!
Example: Calculating the Partial Derivative
We’re going to continue pretending only Alice is in our dataset:
NameWeight (minus 135)Height (minus 66)GenderAlice-2-11
Let’s initialize all the weights to 11 and all the biases to 00. If we do a feedforward pass through the network, we get:
ℎ1=𝑓(𝑤1𝑥1+𝑤2𝑥2+𝑏1)=𝑓(−2+−1+0)=0.0474h1=f(w1x1+w2x2+b1)=f(−2+−1+0)=0.0474ℎ2=𝑓(𝑤3𝑥1+𝑤4𝑥2+𝑏2)=0.0474h2=f(w3x1+w4x2+b2)=0.0474𝑜1=𝑓(𝑤5ℎ1+𝑤6ℎ2+𝑏3)=𝑓(0.0474+0.0474+0)=0.524o1=f(w5h1+w6h2+b3)=f(0.0474+0.0474+0)=0.524
The network outputs 𝑦𝑝𝑟𝑒𝑑=0.524ypred=0.524, which doesn’t strongly favor Male (00) or Female (11). Let’s calculate ∂𝐿∂𝑤1∂w1∂L:
∂𝐿∂𝑤1=∂𝐿∂𝑦𝑝𝑟𝑒𝑑∗∂𝑦𝑝𝑟𝑒𝑑∂ℎ1∗∂ℎ1∂𝑤1∂w1∂L=∂ypred∂L∗∂h1∂ypred∗∂w1∂h1∂𝐿∂𝑦𝑝𝑟𝑒𝑑=−2(1−𝑦𝑝𝑟𝑒𝑑)=−2(1−0.524)=−0.952∂ypred∂L=−2(1−ypred)=−2(1−0.524)=−0.952∂𝑦𝑝𝑟𝑒𝑑∂ℎ1=𝑤5∗𝑓′(𝑤5ℎ1+𝑤6ℎ2+𝑏3)=1∗𝑓′(0.0474+0.0474+0)=𝑓(0.0948)∗(1−𝑓(0.0948))=0.249∂h1∂ypred=w5∗f′(w5h1+w6h2+b3)=1∗f′(0.0474+0.0474+0)=f(0.0948)∗(1−f(0.0948))=0.249∂ℎ1∂𝑤1=𝑥1∗𝑓′(𝑤1𝑥1+𝑤2𝑥2+𝑏1)=−2∗𝑓′(−2+−1+0)=−2∗𝑓(−3)∗(1−𝑓(−3))=−0.0904∂w1∂h1=x1∗f′(w1x1+w2x2+b1)=−2∗f′(−2+−1+0)=−2∗f(−3)∗(1−f(−3))=−0.0904
Reminder: we derived 𝑓′(𝑥)=𝑓(𝑥)∗(1−𝑓(𝑥))f′(x)=f(x)∗(1−f(x)) for our sigmoid activation function earlier.
We did it! This tells us that if we were to increase 𝑤1w1, 𝐿L would increase a tiiiny bit as a result.
Training: Stochastic Gradient Descent
We have all the tools we need to train a neural network now! We’ll use an optimization algorithm called stochastic gradient descent (SGD) that tells us how to change our weights and biases to minimize loss. It’s basically just this update equation:
𝑤1←𝑤1−𝜂∂𝐿∂𝑤1w1←w1−η∂w1∂L
𝜂η is a constant called the learning rate that controls how fast we train. All we’re doing is subtracting 𝜂∂𝐿∂𝑤1η∂w1∂L from 𝑤1w1:
If we do this for every weight and bias in the network, the loss will slowly decrease and our network will improve.
Our training process will look like this:
Code: A Complete Neural Network
It’s finally time to implement a complete neural network:
NameWeight (minus 135)Height (minus 66)GenderAlice-2-11Bob2560Charlie1740Diana-15-61
import numpy as np
def sigmoid(x):
# Sigmoid activation function: f(x) = 1 / (1 + e^(-x))
return 1 / (1 + np.exp(-x))
def deriv_sigmoid(x):
# Derivative of sigmoid: f'(x) = f(x) * (1 - f(x))
fx = sigmoid(x)
return fx * (1 - fx)
def mse_loss(y_true, y_pred):
# y_true and y_pred are numpy arrays of the same length.
return ((y_true - y_pred) ** 2).mean()
class OurNeuralNetwork:
'''
A neural network with:
- 2 inputs
- a hidden layer with 2 neurons (h1, h2)
- an output layer with 1 neuron (o1)
*** DISCLAIMER ***:
The code below is intended to be simple and educational, NOT optimal.
Real neural net code looks nothing like this. DO NOT use this code.
Instead, read/run it to understand how this specific network works.
'''
def __init__(self):
# Weights
self.w1 = np.random.normal()
self.w2 = np.random.normal()
self.w3 = np.random.normal()
self.w4 = np.random.normal()
self.w5 = np.random.normal()
self.w6 = np.random.normal()
# Biases
self.b1 = np.random.normal()
self.b2 = np.random.normal()
self.b3 = np.random.normal()
def feedforward(self, x):
# x is a numpy array with 2 elements.
h1 = sigmoid(self.w1 * x[0] + self.w2 * x[1] + self.b1)
h2 = sigmoid(self.w3 * x[0] + self.w4 * x[1] + self.b2)
o1 = sigmoid(self.w5 * h1 + self.w6 * h2 + self.b3)
return o1
def train(self, data, all_y_trues):
'''
- data is a (n x 2) numpy array, n = # of samples in the dataset.
- all_y_trues is a numpy array with n elements.
Elements in all_y_trues correspond to those in data.
'''
learn_rate = 0.1
epochs = 1000 # number of times to loop through the entire dataset
for epoch in range(epochs):
for x, y_true in zip(data, all_y_trues):
# --- Do a feedforward (we'll need these values later)
sum_h1 = self.w1 * x[0] + self.w2 * x[1] + self.b1
h1 = sigmoid(sum_h1)
sum_h2 = self.w3 * x[0] + self.w4 * x[1] + self.b2
h2 = sigmoid(sum_h2)
sum_o1 = self.w5 * h1 + self.w6 * h2 + self.b3
o1 = sigmoid(sum_o1)
y_pred = o1
# --- Calculate partial derivatives.
# --- Naming: d_L_d_w1 represents "partial L / partial w1"
d_L_d_ypred = -2 * (y_true - y_pred)
# Neuron o1
d_ypred_d_w5 = h1 * deriv_sigmoid(sum_o1)
d_ypred_d_w6 = h2 * deriv_sigmoid(sum_o1)
d_ypred_d_b3 = deriv_sigmoid(sum_o1)
d_ypred_d_h1 = self.w5 * deriv_sigmoid(sum_o1)
d_ypred_d_h2 = self.w6 * deriv_sigmoid(sum_o1)
# Neuron h1
d_h1_d_w1 = x[0] * deriv_sigmoid(sum_h1)
d_h1_d_w2 = x[1] * deriv_sigmoid(sum_h1)
d_h1_d_b1 = deriv_sigmoid(sum_h1)
# Neuron h2
d_h2_d_w3 = x[0] * deriv_sigmoid(sum_h2)
d_h2_d_w4 = x[1] * deriv_sigmoid(sum_h2)
d_h2_d_b2 = deriv_sigmoid(sum_h2)
# --- Update weights and biases
# Neuron h1
self.w1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w1
self.w2 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w2
self.b1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_b1
# Neuron h2
self.w3 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w3
self.w4 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w4
self.b2 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_b2
# Neuron o1
self.w5 -= learn_rate * d_L_d_ypred * d_ypred_d_w5
self.w6 -= learn_rate * d_L_d_ypred * d_ypred_d_w6
self.b3 -= learn_rate * d_L_d_ypred * d_ypred_d_b3
# --- Calculate total loss at the end of each epoch
if epoch % 10 == 0:
y_preds = np.apply_along_axis(self.feedforward, 1, data)
loss = mse_loss(all_y_trues, y_preds)
print("Epoch %d loss: %.3f" % (epoch, loss))
# Define dataset
data = np.array([
[-2, -1], # Alice
[25, 6], # Bob
[17, 4], # Charlie
[-15, -6], # Diana
])
all_y_trues = np.array([
1, # Alice
0, # Bob
0, # Charlie
1, # Diana
])
# Train our neural network!
network = OurNeuralNetwork()
network.train(data, all_y_trues)
Our loss steadily decreases as the network learns:
loss
We can now use the network to predict genders:
# Make some predictions
emily = np.array([-7, -3]) # 128 pounds, 63 inches
frank = np.array([20, 2]) # 155 pounds, 68 inches
print("Emily: %.3f" % network.feedforward(emily)) # 0.951 - F
print("Frank: %.3f" % network.feedforward(frank)) # 0.039 - M