Deep Learning in Action: Building and Training a Neural Network for MNIST Classification and Exploring Backpropagation Through Gradient Descent

Deep Learning in Action: Building and Training a Neural Network for MNIST Classification and Exploring Backpropagation Through Gradient Descent

In the dynamic and ever-evolving field of artificial intelligence, deep learning stands out as a revolutionary force, driving progress across numerous applications—from voice recognition systems to self-driving cars. At the heart of deep learning lies the neural network, a computational architecture inspired by the biological networks within our own brains. This article dives into the practical implementation of neural networks, demonstrating their power and versatility in image recognition tasks using the iconic MNIST dataset.

The MNIST dataset, a collection of handwritten digits, has been the benchmark for image classification algorithms, providing a playground for beginners and experts alike to test the limits of algorithmic accuracy. Our journey will take us through the construction of a deep learning model tailored to classify these digits with precision, incorporating techniques like batch normalization and leveraging the robust features of TensorFlow, a leading framework in the field.

Furthermore, we unravel the complexities of neural networks by manually computing the forward and backward propagation steps, a foundational concept that enables these models to learn from data. This exercise not only solidifies the understanding of how neural networks adjust their parameters to minimize error but also illustrates the mathematical intricacies underpinning these powerful tools.

Whether you are a seasoned data scientist or a curious enthusiast, this article aims to provide a clear and thorough walkthrough of creating a deep learning model for the MNIST dataset and a deeper understanding of the mechanics of neural networks. Join us as we embark on this computational adventure, where each line of code brings us closer to the frontier of artificial intelligence.

No alt text provided for this image

Note1: This article is part of the following article:

Note 2: We will be using TensorFlow. For a quick start with TensorFlow, you may read the following article:

What is the MNIST dataset?

The MNIST dataset (Modified National Institute of Standards and Technology dataset) is a large database of handwritten digits that is commonly used for training various image processing systems. It's one of the most widely used datasets for benchmarking machine learning algorithms, especially in the field of computer vision.

Here are some key points about the MNIST dataset:

  • Content: It contains 70,000 images of handwritten digits (0 through 9). Each image is a 28x28 pixel grayscale image, which is usually flattened into a 784-dimensional vector for processing.
  • Split: The dataset is typically split into a training set of 60,000 examples and a test set of 10,000 examples. This standard split is used to train a model and then test its performance on unseen data to evaluate its generalization capability.
  • Usage: MNIST is often used as a "Hello, World!" dataset in the field of machine learning for image recognition. It's a go-to dataset because the problem it poses is complex enough to require machine learning to solve but also simple enough to be tackled with basic neural networks.
  • Challenges: While modern deep learning models can achieve very high accuracy on the MNIST dataset, it continues to be used in academic settings for educational purposes and to test new types of neural network architectures or training approaches.
  • Historical Importance: MNIST was one of the first datasets that showed the effectiveness of neural networks in computer vision tasks in the 1990s and early 2000s, and it helped to reignite interest in the neural network field.

The high accuracy score you've achieved suggests that your neural network model has learned to recognize the patterns of the handwritten digits quite well, which is typical for models trained on the MNIST dataset.

Before we start into the coding part, I suggest you go through the following video:


1 - Exploratory Data Analysis

Let's first load the dataset and discover what is inside

1.1 Install TensorFlow

!pip install tensorflow

1.2 Load MNIST Dataset

import tensorflow as tf

mnist = tf.keras.datasets.mnist

1.3 Explore the content and structure of the MNIST dataset

You can explore the content and structure of the MNIST dataset by examining the arrays that are loaded into memory. When you load MNIST using tf.keras.datasets.mnist, it returns two tuples: one for the training data and one for the test data. Each tuple contains images and their corresponding labels.

Here's how you can inspect the content and structure:

import tensorflow as tf

# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Check the shape of the arrays
print("Training images shape:", train_images.shape)  # Should be (60000, 28, 28)
print("Training labels shape:", train_labels.shape)  # Should be (60000,)
print("Test images shape:", test_images.shape)       # Should be (10000, 28, 28)
print("Test labels shape:", test_labels.shape)       # Should be (10000,)

# Check the range of pixel values
print("Training images pixel values range from", train_images.min(), "to", train_images.max())
print("Test images pixel values range from", test_images.min(), "to", test_images.max())

# Check the first few labels
print("First 10 training labels:", train_labels[:10])

# Visualize the first image in the training dataset
import matplotlib.pyplot as plt

plt.imshow(train_images[0], cmap='gray')
plt.title(f"Label: {train_labels[0]}")

No alt text provided for this image

The previous code does the following:

  • Loads the MNIST dataset.
  • Prints the shapes of the training and testing arrays, which should show 60,000 training images and 10,000 test images, each of size 28x28 pixels.
  • Prints the range of pixel values, which will typically be between 0 and 255 for grayscale images. Test images pixel values range from 0 to 255. Test images pixel values range from 0 to 255
  • Prints the first 10 labels from the training dataset, which will be integers from 0 to 9 representing the digits in the corresponding images. First 10 training labels: [5 0 4 1 9 2 1 3 1 4]
  • Uses matplotlib to visualize the first image in the training dataset, along with its label. This image for number 5.

No alt text provided for this image

This will give you a good sense of the structure and content of the MNIST dataset.

For fun, here are the first 10 images in the training set:

No alt text provided for this image

Here is the code to plot it:

import tensorflow as t
import matplotlib.pyplot as plt

# Load the MNIST dataset
mnist = tf.keras.datasets.mnist
(train_images, train_labels), (_, _) = mnist.load_data()

# Plot the first 10 images in the training set
plt.figure(figsize=(20, 4))
for i in range(10):
    plt.subplot(1, 10, i+1)
    plt.imshow(train_images[i], cmap='gray')
    plt.title(f"Label: {train_labels[i]}")

An MNIST digit image is a 28x28 pixel grayscale image of a single handwritten digit (0-9). Each pixel in the image has a value between 0 (black) and 255 (white), representing the intensity of the grayscale color at that pixel.

Here's an example of an MNIST digit image of the number "2":

No alt text provided for this image

MNIST digit 2

As you can see, the image is indeed 28x28 pixels and contains only the shape of the digit "2" with all other pixels being black (0).

2 - Data Preparation

2.1 - Normalize the Data

Let's Normalize the data to 0 and 1

# Normalize the image
mnist_train_images = mnist_train_images / 255.0
mnist_test_images = mnist_test_images / 255.0s        

  • Objective: The purpose of this code is to normalize the pixel values of images from the MNIST dataset. Normalization is a preprocessing step which involves transforming the values of pixels so that they fall within a certain range, often between 0 and 1. This can help improve the convergence speed of the training process and the overall performance of the model.
  • Normalization Process:

mnist_train_images = mnist_train_images / 255.0

This line of code takes the training images from the MNIST dataset and divides each pixel value by 255. Pixel values in an image are typically in the range of 0 to 255, representing the intensity of a pixel in grayscale (0 being black, 255 being white, and values in between representing shades of gray).

  • Dividing by 255 scales these values down to a range between 0 and 1. This is done for all the training images.

mnist_test_images = mnist_test_images / 255.0

Similarly, this line scales the pixel values of the test images from the test dataset to a range between 0 and 1 by dividing each pixel value by 255.

Importance of Normalization

  • Consistency: Ensures that all input features (pixel values, in this case) are on a consistent scale, which is important for models that are sensitive to the magnitude of values, such as neural networks.
  • Convergence: Helps in faster convergence of gradient descent during the training process, as it ensures that all the inputs have a uniform scale.
  • Performance: Can lead to better performance and lower generalization error of the model, as it prevents the model from prioritizing the input features incorrectly based on their scale.


This normalization technique is particularly common in the preprocessing steps for deep learning models dealing with images, such as Convolutional Neural Networks (CNNs) used for image classification tasks.

Here is what it looks like for the number "1" after normalization. It is a normalized matrix to be only 0 and 1

No alt text provided for this image

It is a 3D matrix (Samples, x, y). Training images shape: (60000, 28, 28)

2.2 - Flatten the images to 1D Vector

# Flatten the images to one-dimensional vector
train_images = train_images.reshape((train_images.shape[0], 28 * 28))
test_images = test_images.reshape((test_images.shape[0], 28 * 28))s        

There are a few important reasons why we flatten images in the MNIST example to one-dimensional vectors before feeding them into the deep learning model:

  • Compatibility with Dense Layers: Dense (or fully-connected) layers are the fundamental building blocks of many deep neural networks. These layers expect their input to be in the form of one-dimensional vectors. Each neuron in a dense layer takes an input, multiplies it by a weight, adds a bias, and applies an activation function. Flattening the image allows each pixel to be treated as an individual feature input for these neurons.
  • Computational Efficiency: While convolutional neural networks (CNNs) are better suited for maintaining spatial relationships in image data, a simple dense-layer based model can still work reasonably well on the MNIST dataset. Flattening reduces the dimensionality of the data, leading to fewer computations during the training process.
  • Image Simplification: In the case of MNIST, the digits are relatively simple and centered within the images. Flattening doesn't result in a major loss of positional or structural information vital to accurately classifying the numbers.

Also, you can use the Flatten layer that can do that automatically as follows:

# Build the mode
model = tf.keras.Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')  # 10 classes for the digits 0-9

Preprocessing with Flatten: The common practice is to use a Flatten layer as the first layer (after specifying input_shape) in a Sequential model when dealing with image data. This layer converts the 2D image data into a 1D array, making it compatible with the Dense layer's expectations. It effectively reshapes the input images from a 2D format to a format (1D vector) that the Dense layer can work with.

3 - Build the Model

we will build the following Simple Neural Network (We will solve it using a better CNN later):

No alt text provided for this image

# Build the model
model = tf.keras.Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')  # 10 classes for the digits 0-9

The previous code constructs a neural network model using TensorFlow's Keras API. This model is designed for a classification task for the MNIST dataset, which consists of 28x28 pixel grayscale images of handwritten digits (0-9). Here's a step-by-step explanation of the model:

Code Breakdown

  • tf.keras.Sequential: This initializes a sequential model, which is a linear stack of layers. Sequential models allow you to create models layer-by-layer in a step-by-step fashion.
  • Flatten(input_shape=(28, 28)):
  • The first layer is a Flatten layer, which transforms the input images from a 2D array (of 28 by 28 pixels) into a 1D array (of 784 pixels).
  • input_shape=(28, 28) specifies the shape of the input data, indicating that each input image is 28 pixels in height and 28 pixels in width. There's no need to specify the channel dimension for grayscale images, as it is implied to be 1.
  • Dense(128, activation='relu'):
  • The next layer is a Dense (fully connected) layer with 128 neurons, or units.
  • The activation='relu' parameter applies the Rectified Linear Unit (ReLU) activation function to the output of each neuron. ReLU is used to introduce non-linearity into the model, allowing it to learn more complex patterns.
  • BatchNormalization():

This layer applies batch normalization, a technique to normalize the inputs of each layer. It helps to accelerate the training process, improve model stability, and reduce the sensitivity to network initialization.

By normalizing the output of the previous layer, it ensures that the network always creates activations with the same distribution that we desire.

  • Dense(10, activation='softmax'):

The final layer is another Dense layer with 10 neurons, corresponding to the 10 classes of the digits (0-9) that the model is trying to classify.

  • The activation='softmax' parameter applies the softmax activation function to the output. Softmax converts the outputs to probabilities that sum up to 1, effectively allowing the model to output a probability distribution over the 10 classes. This is particularly useful for multi-class classification tasks.

Model Overview

This model architecture is relatively simple and consists of an input layer (the Flatten layer), one hidden layer (the first Dense layer), a batch normalization layer to improve training efficiency and stability, and an output layer (the second Dense layer). It's designed to classify 28x28 pixel images into one of 10 classes (digits 0-9). The use of relu activation functions in hidden layers helps to mitigate the vanishing gradient problem, and the softmax activation in the output layer makes it suitable for multi-class classification. Batch normalization is included to enhance the training dynamics.

Here is the full program:

import tensorflow as t
from tensorflow.keras.layers import Flatten, Dense, BatchNormalization
from tensorflow.keras.callbacks import ModelCheckpoint

# Load the dataset
(mnist_train_images, mnist_train_labels), (mnist_test_images, mnist_test_labels) = tf.keras.datasets.mnist.load_data()

# Normalize the images
mnist_train_images = mnist_train_images / 255.0
mnist_test_images = mnist_test_images / 255.0

# Build the model
model = tf.keras.Sequential([
    Flatten(input_shape=(28, 28)),
    Dense(128, activation='relu'),
    Dense(10, activation='softmax')  # 10 classes for the digits 0-9

# Compile the model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Set up model checkpoints
checkpoint_path = 'mnist_model_checkpoint.h5'
checkpoint = ModelCheckpoint(checkpoint_path, save_best_only=True, monitor='val_accuracy', mode='max')

# Fit the model, mnist_train_labels, epochs=10, validation_split=0.2, callbacks=[checkpoint])

# Evaluate the model
test_loss, test_acc = model.evaluate(mnist_test_images, mnist_test_labels)

print(f'Test accuracy: {test_acc}')


No alt text provided for this image

Here's a step-by-step guide again:

1. Load the MNIST dataset

from tensorflow.keras.datasets import mnist

(train_images, train_labels), (test_images, test_labels) = mnist.load_data()

# Normalize the pixel values to be between 0 and 1
train_images = train_images / 255.0
test_images = test_images / 255.0

# Flatten the images to one-dimensional vectors
train_images = train_images.reshape((train_images.shape[0], 28 * 28))
test_images = test_images.reshape((test_images.shape[0], 28 * 28))

2. Define the deep learning model

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Activation

model = Sequential([
    Dense(512, activation='relu', input_shape=(784,)),
    Dense(256, activation='relu'),
    Dense(10, activation='softmax')  # 10 units for 10 classes


3. Train the model with batch normalization and model checkpoints

from tensorflow.keras.callbacks import ModelCheckpoint

checkpoint = ModelCheckpoint('model.h5', save_best_only=True), train_labels, epochs=5, batch_size=32, validation_data=(test_images, test_labels), callbacks=[checkpoint])

4. Fit a regression line using TensorFlow's gradient tape

4.1. Prepare the regression dataset

Since we already have regression dataset prepared as NumPy arrays x_data and y_data, Let's proceed with the following steps:

4.2. Define the model and loss function

import tensorflow as tf

# Define the model as a simple linear regression function
def model(x):
  weights = tf.Variable(tf.random.normal([x.shape[1], 1]))
  bias = tf.Variable(0.0)
  return weights @ x.T + bias

# Define the loss function as mean squared error
def loss(y_true, y_pred):
  return tf.reduce_mean((y_true - y_pred) ** 2)

4.3. Implement gradient tape and training loop

# Create an optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)

# Training loop
for epoch in range(100):
  with tf.GradientTape() as tape:
    # Get predictions
    y_pred = model(x_data)
    # Calculate loss
    loss_value = loss(y_data, y_pred)

  # Calculate gradients
  grads = tape.gradient(loss_value, [weights, bias])

  # Update weights
  optimizer.apply_gradients(zip([weights, bias], grads))

  # Print loss
  print(f"Epoch {epoch+1}, Loss: {loss_value.numpy()}")

This code defines a simple linear regression model, calculates the mean squared error loss, and updates the weights using the Adam optimizer within a training loop.

By following these steps, you'll have created a deep learning model for classifying MNIST images with batch normalization and saved checkpoints, as well as fit a regression line using TensorFlow's gradient tape.

Kevin Kakolla

Python Trainer/Developer/Training GPT/BERT Models - Machine Learning Engineer/Data Scientist/LLM Developer


Excellent Info!!

Exciting insights! Neural networks are truly transformative. 👍

To view or add a comment, sign in

More articles by Rany ElHousieny, PhDᴬᴮᴰ

Insights from the community

Others also viewed

Explore topics