Machine Learning for Beginners: An Introduction to Neural Networks

A simple explanation of how they work and how to implement one from scratch in Python.

Here’s something that might surprise you: neural networks aren’t that complicated! The term “neural network” gets used as a buzzword a lot, but in reality they’re often much simpler than people imagine.

This post is intended for complete beginners and assumes ZERO prior knowledge of machine learning. We’ll understand how neural networks work while implementing one from scratch in Python.

Let’s get started!

1. Building Blocks: Neurons

First, we have to talk about neurons, the basic unit of a neural network. A neuron takes inputs, does some math with them, and produces one output. Here’s what a 2-input neuron looks like:

3 things are happening here. First, each input is multiplied by a weight:

𝑥1→𝑥1∗𝑤1x1→x1∗w1𝑥2→𝑥2∗𝑤2x2→x2∗w2

Next, all the weighted inputs are added together with a bias 𝑏b:

(𝑥1∗𝑤1)+(𝑥2∗𝑤2)+𝑏(x1∗w1)+(x2∗w2)+b

Finally, the sum is passed through an activation function:

The activation function is used to turn an unbounded input into an output that has a nice, predictable form. A commonly used activation function is the sigmoid function:

The sigmoid function only outputs numbers in the range (0,1)(0,1). You can think of it as compressing (−∞,+∞)(−∞,+∞) to (0,1)(0,1) - big negative numbers become ~00, and big positive numbers become ~11.

A Simple Example

Assume we have a 2-input neuron that uses the sigmoid activation function and has the following parameters:

𝑤=[0,1]w=[0,1]𝑏=4b=4

𝑤=[0,1]w=[0,1] is just a way of writing 𝑤1=0,𝑤2=1w1=0,w2=1 in vector form. Now, let’s give the neuron an input of 𝑥=[2,3]x=[2,3]. We’ll use the dot product to write things more concisely:

(𝑤⋅𝑥)+𝑏=((𝑤1∗𝑥1)+(𝑤2∗𝑥2))+𝑏=0∗2+1∗3+4=7(w⋅x)+b=((w1∗x1)+(w2∗x2))+b=0∗2+1∗3+4=7𝑦=𝑓(𝑤⋅𝑥+𝑏)=𝑓(7)=0.999y=f(w⋅x+b)=f(7)=0.999

The neuron outputs 0.9990.999 given the inputs 𝑥=[2,3]x=[2,3]. That’s it! This process of passing inputs forward to get an output is known as feedforward.

Coding a Neuron

Time to implement a neuron! We’ll use NumPy, a popular and powerful computing library for Python, to help us do math:

import numpy as np

def sigmoid(x):
  # Our activation function: f(x) = 1 / (1 + e^(-x))
  return 1 / (1 + np.exp(-x))

class Neuron:
  def __init__(self, weights, bias):
    self.weights = weights
    self.bias = bias

  def feedforward(self, inputs):
    # Weight inputs, add bias, then use the activation function
    total = np.dot(self.weights, inputs) + self.bias
    return sigmoid(total)

weights = np.array([0, 1]) # w1 = 0, w2 = 1
bias = 4                   # b = 4
n = Neuron(weights, bias)

x = np.array([2, 3])       # x1 = 2, x2 = 3
print(n.feedforward(x))    # 0.9990889488055994

Recognize those numbers? That’s the example we just did! We get the same answer of 0.9990.999.

2. Combining Neurons into a Neural Network

A neural network is nothing more than a bunch of neurons connected together. Here’s what a simple neural network might look like:

This network has 2 inputs, a hidden layer with 2 neurons (ℎ1h1 and ℎ2h2), and an output layer with 1 neuron (𝑜1o1). Notice that the inputs for 𝑜1o1 are the outputs from ℎ1h1 and ℎ2h2 - that’s what makes this a network.

A hidden layer is any layer between the input (first) layer and output (last) layer. There can be multiple hidden layers!

An Example: Feedforward

Let’s use the network pictured above and assume all neurons have the same weights 𝑤=[0,1]w=[0,1], the same bias 𝑏=0b=0, and the same sigmoid activation function. Let ℎ1,ℎ2,𝑜1h1,h2,o1 denote the outputs of the neurons they represent.

What happens if we pass in the input 𝑥=[2,3]x=[2,3]?

ℎ1=ℎ2=𝑓(𝑤⋅𝑥+𝑏)=𝑓((0∗2)+(1∗3)+0)=𝑓(3)=0.9526h1=h2=f(w⋅x+b)=f((0∗2)+(1∗3)+0)=f(3)=0.9526𝑜1=𝑓(𝑤⋅[ℎ1,ℎ2]+𝑏)=𝑓((0∗ℎ1)+(1∗ℎ2)+0)=𝑓(0.9526)=0.7216o1=f(w⋅[h1,h2]+b)=f((0∗h1)+(1∗h2)+0)=f(0.9526)=0.7216

The output of the neural network for input 𝑥=[2,3]x=[2,3] is 0.72160.7216. Pretty simple, right?

A neural network can have any number of layers with any number of neurons in those layers. The basic idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at the end. For simplicity, we’ll keep using the network pictured above for the rest of this post.

Coding a Neural Network: Feedforward

Let’s implement feedforward for our neural network. Here’s the image of the network again for reference:

import numpy as np

# ... code from previous section here

class OurNeuralNetwork:
  '''
  A neural network with:
    - 2 inputs
    - a hidden layer with 2 neurons (h1, h2)
    - an output layer with 1 neuron (o1)
  Each neuron has the same weights and bias:
    - w = [0, 1]
    - b = 0
  '''
  def __init__(self):
    weights = np.array([0, 1])
    bias = 0

    # The Neuron class here is from the previous section
    self.h1 = Neuron(weights, bias)
    self.h2 = Neuron(weights, bias)
    self.o1 = Neuron(weights, bias)

  def feedforward(self, x):
    out_h1 = self.h1.feedforward(x)
    out_h2 = self.h2.feedforward(x)

    # The inputs for o1 are the outputs from h1 and h2
    out_o1 = self.o1.feedforward(np.array([out_h1, out_h2]))

    return out_o1

network = OurNeuralNetwork()
x = np.array([2, 3])
print(network.feedforward(x)) # 0.7216325609518421

We got 0.72160.7216 again! Looks like it works.

3. Training a Neural Network, Part 1

Say we have the following measurements:

NameWeight (lb)Height (in)GenderAlice13365FBob16072MCharlie15270MDiana12060F

Let’s train our network to predict someone’s gender given their weight and height:

We’ll represent Male with a 00 and Female with a 11, and we’ll also shift the data to make it easier to use:

NameWeight (minus 135)Height (minus 66)GenderAlice-2-11Bob2560Charlie1740Diana-15-61

I arbitrarily chose the shift amounts (135135 and 6666) to make the numbers look nice. Normally, you’d shift by the mean.

Loss

Before we train our network, we first need a way to quantify how “good” it’s doing so that it can try to do “better”. That’s what the loss is.

We’ll use the mean squared error (MSE) loss:

MSE=1𝑛∑𝑖=1𝑛(𝑦𝑡𝑟𝑢𝑒−𝑦𝑝𝑟𝑒𝑑)2

computed 𝐿=(1−𝑦𝑝𝑟𝑒𝑑)2L=(1−ypred)2 above:

∂𝐿∂𝑦𝑝𝑟𝑒𝑑=∂(1−𝑦𝑝𝑟𝑒𝑑)2∂𝑦𝑝𝑟𝑒𝑑=−2(1−𝑦𝑝𝑟𝑒𝑑)∂ypred∂L=∂ypred∂(1−ypred)2=−2(1−ypred)

Now, let’s figure out what to do with ∂𝑦𝑝𝑟𝑒𝑑∂𝑤1∂w1∂ypred. Just like before, let ℎ1,ℎ2,𝑜1h1,h2,o1 be the outputs of the neurons they represent. Then

𝑦𝑝𝑟𝑒𝑑=𝑜1=𝑓(𝑤5ℎ1+𝑤6ℎ2+𝑏3)ypred=o1=f(w5h1+w6h2+b3)

Since 𝑤1w1 only affects ℎ1h1 (not ℎ2h2), we can write

∂𝑦𝑝𝑟𝑒𝑑∂𝑤1=∂𝑦𝑝𝑟𝑒𝑑∂ℎ1∗∂ℎ1∂𝑤1∂w1∂ypred=∂h1∂ypred∗∂w1∂h1∂𝑦𝑝𝑟𝑒𝑑∂ℎ1=𝑤5∗𝑓′(𝑤5ℎ1+𝑤6ℎ2+𝑏3)∂h1∂ypred=w5∗f′(w5h1+w6h2+b3)

We do the same thing for ∂ℎ1∂𝑤1∂w1∂h1:

ℎ1=𝑓(𝑤1𝑥1+𝑤2𝑥2+𝑏1)h1=f(w1x1+w2x2+b1)∂ℎ1∂𝑤1=𝑥1∗𝑓′(𝑤1𝑥1+𝑤2𝑥2+𝑏1)

second time we’ve seen 𝑓′(𝑥)f′(x) (the derivate of the sigmoid function) now! Let’s derive it:

𝑓(𝑥)=11+𝑒−𝑥f(x)=1+e−x1𝑓′(𝑥)=𝑒−𝑥(1+𝑒−𝑥)2=𝑓(𝑥)∗(1−𝑓(𝑥))f′(x)=(1+e−x)2e−x=f(x)∗(1−f(x))

We’ll use this nice form for 𝑓′(𝑥)f′(x) later.

We’re done! We’ve managed to break down ∂𝐿∂𝑤1∂w1∂L into several parts we can calculate:

∂𝐿∂𝑤1=∂𝐿∂𝑦𝑝𝑟𝑒𝑑∗∂𝑦𝑝𝑟𝑒𝑑∂ℎ1∗∂ℎ1∂𝑤1∂w1∂L=∂ypred∂L∗∂h1∂ypred∗∂w1∂h1

This system of calculating partial derivatives by working backwards is known as backpropagation, or “backprop”.

Phew. That was a lot of symbols - it’s alright if you’re still a bit confused. Let’s do an example to see this in action!

Example: Calculating the Partial Derivative

We’re going to continue pretending only Alice is in our dataset:

NameWeight (minus 135)Height (minus 66)GenderAlice-2-11

Let’s initialize all the weights to 11 and all the biases to 00. If we do a feedforward pass through the network, we get:

ℎ1=𝑓(𝑤1𝑥1+𝑤2𝑥2+𝑏1)=𝑓(−2+−1+0)=0.0474h1=f(w1x1+w2x2+b1)=f(−2+−1+0)=0.0474ℎ2=𝑓(𝑤3𝑥1+𝑤4𝑥2+𝑏2)=0.0474h2=f(w3x1+w4x2+b2)=0.0474𝑜1=𝑓(𝑤5ℎ1+𝑤6ℎ2+𝑏3)=𝑓(0.0474+0.0474+0)=0.524o1=f(w5h1+w6h2+b3)=f(0.0474+0.0474+0)=0.524

The network outputs 𝑦𝑝𝑟𝑒𝑑=0.524ypred=0.524, which doesn’t strongly favor Male (00) or Female (11). Let’s calculate ∂𝐿∂𝑤1∂w1∂L:

∂𝐿∂𝑤1=∂𝐿∂𝑦𝑝𝑟𝑒𝑑∗∂𝑦𝑝𝑟𝑒𝑑∂ℎ1∗∂ℎ1∂𝑤1∂w1∂L=∂ypred∂L∗∂h1∂ypred∗∂w1∂h1∂𝐿∂𝑦𝑝𝑟𝑒𝑑=−2(1−𝑦𝑝𝑟𝑒𝑑)=−2(1−0.524)=−0.952∂ypred∂L=−2(1−ypred)=−2(1−0.524)=−0.952∂𝑦𝑝𝑟𝑒𝑑∂ℎ1=𝑤5∗𝑓′(𝑤5ℎ1+𝑤6ℎ2+𝑏3)=1∗𝑓′(0.0474+0.0474+0)=𝑓(0.0948)∗(1−𝑓(0.0948))=0.249∂h1∂ypred=w5∗f′(w5h1+w6h2+b3)=1∗f′(0.0474+0.0474+0)=f(0.0948)∗(1−f(0.0948))=0.249∂ℎ1∂𝑤1=𝑥1∗𝑓′(𝑤1𝑥1+𝑤2𝑥2+𝑏1)=−2∗𝑓′(−2+−1+0)=−2∗𝑓(−3)∗(1−𝑓(−3))=−0.0904∂w1∂h1=x1∗f′(w1x1+w2x2+b1)=−2∗f′(−2+−1+0)=−2∗f(−3)∗(1−f(−3))=−0.0904

Reminder: we derived 𝑓′(𝑥)=𝑓(𝑥)∗(1−𝑓(𝑥))f′(x)=f(x)∗(1−f(x)) for our sigmoid activation function earlier.

We did it! This tells us that if we were to increase 𝑤1w1, 𝐿L would increase a tiiiny bit as a result.

Training: Stochastic Gradient Descent

We have all the tools we need to train a neural network now! We’ll use an optimization algorithm called stochastic gradient descent (SGD) that tells us how to change our weights and biases to minimize loss. It’s basically just this update equation:

𝑤1←𝑤1−𝜂∂𝐿∂𝑤1w1←w1−η∂w1∂L

𝜂η is a constant called the learning rate that controls how fast we train. All we’re doing is subtracting 𝜂∂𝐿∂𝑤1η∂w1∂L from 𝑤1w1:

If ∂𝐿∂𝑤1∂w1∂L is positive, 𝑤1w1 will decrease, which makes 𝐿L decrease.
If ∂𝐿∂𝑤1∂w1∂L is negative, 𝑤1w1 will increase, which makes 𝐿L decrease.

If we do this for every weight and bias in the network, the loss will slowly decrease and our network will improve.

Our training process will look like this:

Choose one sample from our dataset. This is what makes it stochastic gradient descent - we only operate on one sample at a time.
Calculate all the partial derivatives of loss with respect to weights or biases (e.g. ∂𝐿∂𝑤1∂w1∂L, ∂𝐿∂𝑤2∂w2∂L, etc).
Use the update equation to update each weight and bias.

Code: A Complete Neural Network

It’s finally time to implement a complete neural network:

NameWeight (minus 135)Height (minus 66)GenderAlice-2-11Bob2560Charlie1740Diana-15-61

import numpy as np

def sigmoid(x):
  # Sigmoid activation function: f(x) = 1 / (1 + e^(-x))
  return 1 / (1 + np.exp(-x))

def deriv_sigmoid(x):
  # Derivative of sigmoid: f'(x) = f(x) * (1 - f(x))
  fx = sigmoid(x)
  return fx * (1 - fx)

def mse_loss(y_true, y_pred):
  # y_true and y_pred are numpy arrays of the same length.
  return ((y_true - y_pred) ** 2).mean()

class OurNeuralNetwork:
  '''
  A neural network with:
    - 2 inputs
    - a hidden layer with 2 neurons (h1, h2)
    - an output layer with 1 neuron (o1)

  *** DISCLAIMER ***:
  The code below is intended to be simple and educational, NOT optimal.
  Real neural net code looks nothing like this. DO NOT use this code.
  Instead, read/run it to understand how this specific network works.
  '''
  def __init__(self):
    # Weights
    self.w1 = np.random.normal()
    self.w2 = np.random.normal()
    self.w3 = np.random.normal()
    self.w4 = np.random.normal()
    self.w5 = np.random.normal()
    self.w6 = np.random.normal()

    # Biases
    self.b1 = np.random.normal()
    self.b2 = np.random.normal()
    self.b3 = np.random.normal()

  def feedforward(self, x):
    # x is a numpy array with 2 elements.
    h1 = sigmoid(self.w1 * x[0] + self.w2 * x[1] + self.b1)
    h2 = sigmoid(self.w3 * x[0] + self.w4 * x[1] + self.b2)
    o1 = sigmoid(self.w5 * h1 + self.w6 * h2 + self.b3)
    return o1

  def train(self, data, all_y_trues):
    '''
    - data is a (n x 2) numpy array, n = # of samples in the dataset.
    - all_y_trues is a numpy array with n elements.
      Elements in all_y_trues correspond to those in data.
    '''
    learn_rate = 0.1
    epochs = 1000 # number of times to loop through the entire dataset

    for epoch in range(epochs):
      for x, y_true in zip(data, all_y_trues):
        # --- Do a feedforward (we'll need these values later)
        sum_h1 = self.w1 * x[0] + self.w2 * x[1] + self.b1
        h1 = sigmoid(sum_h1)

        sum_h2 = self.w3 * x[0] + self.w4 * x[1] + self.b2
        h2 = sigmoid(sum_h2)

        sum_o1 = self.w5 * h1 + self.w6 * h2 + self.b3
        o1 = sigmoid(sum_o1)
        y_pred = o1

        # --- Calculate partial derivatives.
        # --- Naming: d_L_d_w1 represents "partial L / partial w1"
        d_L_d_ypred = -2 * (y_true - y_pred)

        # Neuron o1
        d_ypred_d_w5 = h1 * deriv_sigmoid(sum_o1)
        d_ypred_d_w6 = h2 * deriv_sigmoid(sum_o1)
        d_ypred_d_b3 = deriv_sigmoid(sum_o1)

        d_ypred_d_h1 = self.w5 * deriv_sigmoid(sum_o1)
        d_ypred_d_h2 = self.w6 * deriv_sigmoid(sum_o1)

        # Neuron h1
        d_h1_d_w1 = x[0] * deriv_sigmoid(sum_h1)
        d_h1_d_w2 = x[1] * deriv_sigmoid(sum_h1)
        d_h1_d_b1 = deriv_sigmoid(sum_h1)

        # Neuron h2
        d_h2_d_w3 = x[0] * deriv_sigmoid(sum_h2)
        d_h2_d_w4 = x[1] * deriv_sigmoid(sum_h2)
        d_h2_d_b2 = deriv_sigmoid(sum_h2)

        # --- Update weights and biases
        # Neuron h1
        self.w1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w1
        self.w2 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_w2
        self.b1 -= learn_rate * d_L_d_ypred * d_ypred_d_h1 * d_h1_d_b1

        # Neuron h2
        self.w3 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w3
        self.w4 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_w4
        self.b2 -= learn_rate * d_L_d_ypred * d_ypred_d_h2 * d_h2_d_b2

        # Neuron o1
        self.w5 -= learn_rate * d_L_d_ypred * d_ypred_d_w5
        self.w6 -= learn_rate * d_L_d_ypred * d_ypred_d_w6
        self.b3 -= learn_rate * d_L_d_ypred * d_ypred_d_b3

      # --- Calculate total loss at the end of each epoch
      if epoch % 10 == 0:
        y_preds = np.apply_along_axis(self.feedforward, 1, data)
        loss = mse_loss(all_y_trues, y_preds)
        print("Epoch %d loss: %.3f" % (epoch, loss))
# Define dataset
data = np.array([
  [-2, -1],  # Alice
  [25, 6],   # Bob
  [17, 4],   # Charlie
  [-15, -6], # Diana
])
all_y_trues = np.array([
  1, # Alice
  0, # Bob
  0, # Charlie
  1, # Diana
])

# Train our neural network!
network = OurNeuralNetwork()
network.train(data, all_y_trues)
Our loss steadily decreases as the network learns:

loss

We can now use the network to predict genders:

# Make some predictions
emily = np.array([-7, -3]) # 128 pounds, 63 inches
frank = np.array([20, 2])  # 155 pounds, 68 inches
print("Emily: %.3f" % network.feedforward(emily)) # 0.951 - F
print("Frank: %.3f" % network.feedforward(frank)) # 0.039 - M

Machine Learning for Beginners: An Introduction to Neural Networks

Mohan Nayak

Aspiring Data Analyst

A simple explanation of how they work and how to implement one from scratch in Python.

1. Building Blocks: Neurons

A Simple Example

Coding a Neuron

2. Combining Neurons into a Neural Network

An Example: Feedforward

Coding a Neural Network: Feedforward

3. Training a Neural Network, Part 1

Recommended by LinkedIn

Loss

Example: Calculating the Partial Derivative

Training: Stochastic Gradient Descent

Code: A Complete Neural Network

More articles by Mohan Nayak

Insights from the community

Others also viewed

How to Classify the paintings of an artist using Convolutional Neural Network

Building a neural network in python is quite simple

Krish Naik Udemy Coupon Code

What is PyTorch used for (practical use cases)

MLBP 9: ONNX Shakes up the Deep Learning Landscape and Numpy Drops Support for Python 2.7

Real-time 'me-not_me' Face Detector

Deep Learning in Python with TensorFlow/Keras in practical AI. Training the Sequential Feedforward architecture.

Deep Learning in Python with TensorFlow and Keras API for creating AI algorithms/models. Sequential models.

What is PyTorch used for (practical use cases)

Creating a machine learning image classifier in under 45 mins with FastAI

Explore topics

A simple explanation of how they work and how to implement one from scratch in Python.

1. Building Blocks: Neurons

A Simple Example

Coding a Neuron

2. Combining Neurons into a Neural Network

An Example: Feedforward

Coding a Neural Network: Feedforward

3. Training a Neural Network, Part 1

Recommended by LinkedIn

Loss

Example: Calculating the Partial Derivative

Training: Stochastic Gradient Descent

Code: A Complete Neural Network

More articles by Mohan Nayak

Printing Tabular Data using Python

SQL Essential Concepts for Data Analyst Interviews

How Docker Works (Simplified)

How Docker Works

SQL Essential Concepts for Data Analyst Interviews

Here's an explanation of why ethical hackers love Python programming⌨️🛜

Airflow Architecture

Top 10 Full Stack Developer Frameworks

Difference between Pandas and Numpy and their uses.

Artificial Agents Become Natural Companions

Insights from the community

Others also viewed

How to Classify the paintings of an artist using Convolutional Neural Network

Building a neural network in python is quite simple

Krish Naik Udemy Coupon Code

What is PyTorch used for (practical use cases)

MLBP 9: ONNX Shakes up the Deep Learning Landscape and Numpy Drops Support for Python 2.7

Real-time 'me-not_me' Face Detector

Deep Learning in Python with TensorFlow/Keras in practical AI. Training the Sequential Feedforward architecture.

Deep Learning in Python with TensorFlow and Keras API for creating AI algorithms/models. Sequential models.

What is PyTorch used for (practical use cases)

Creating a machine learning image classifier in under 45 mins with FastAI

Explore topics