BxD Primer Series: Gated Recurrent Unit (GRU) Neural Networks

Mayank K.

Founding Partner - BUSINESS x DATA (Implementing AI-Driven Personalization at Scale)

Published Jun 22, 2023

Hey there 👋

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on Gated Recurrent Unit (GRU) Neural Networks. Let’s get started:

The What:

GRU is a variant of the Long Short-Term Memory (LSTM) network, another type of RNN that is widely used for sequence modeling tasks. Check our editions on RNN here and LSTM here.

Like LSTM, GRU is designed to address the vanishing gradient problem that can occur when training RNNs. Vanishing gradient is a problem where gradients become very small as they are propagated back through the network during training, which makes it difficult to learn long-term dependencies in data.

GRU addresses this problem by using a gating mechanism that allows it to selectively update and reset its hidden state. This gating mechanism is simpler than the one used in LSTM, which makes GRU easier to train and faster to compute.

The core idea of the GRU is to use two gates: an update gate and a reset gate. These gates are used to control the flow of information through the network, allowing it to selectively remember or forget previous information.

Update gate determines how much of the previous hidden state to keep and how much of new candidate state to use.
Reset gate determines how much of the previous hidden state to forget and how much of new candidate state to use.

The update gate and reset gate are both sigmoid functions that take as input the concatenation of the previous hidden state and the current input. The output of these gates is then used to update hidden state of network.

Basic Architecture:

GRU has become a popular choice for sequence modeling tasks such as language modeling, machine translation, and speech recognition due to its simplicity, efficiency, and effectiveness.

Here is a ‘time unfolded’ version of GRN network:

In above diagram:

t is the current time step
x(t) is the input at current time step
h(t) is the output of previous time step
r(t) is the reset gate
z(t) is the update gate
h(t+1) is the new output at current time step

Difference between GRU and LSTM:

GRU (Gated Recurrent Unit) and LSTM (Long Short-Term Memory) are both popular types of recurrent neural networks (RNNs) that are designed to model sequential data. They are both capable of processing variable-length sequences of input data and maintaining information over long time intervals.

The primary difference between GRU and LSTM lies in their respective architectures. While both networks have gates that control the flow of information, the gating mechanisms in GRU are simpler and more streamlined than those in LSTM.

GRU uses only two gates: an update gate and a reset gate.

Update gate controls the flow of information from previous hidden state to current hidden state
Reset gate controls how much of the past hidden state to forget. Candidate hidden state is then computed using current input and reset gate, and Updated hidden state is obtained by combining the candidate hidden state and previous hidden state using update gate.

LSTM uses three gates: an input gate, an output gate, and a forget gate.

Input gate controls the flow of new information into the cell
Forget gate controls the flow of information from previous cell state
Output gate controls the flow of information from current cell state to next hidden state. Cell state is updated using the input and forget gates, and hidden state is obtained by combining current cell state and output gate.

The How:

GRUs can be broken down into several equations that are computed at each time step t.

Let x_t be the input at time step t, and h_{t-1} be the output of previous time step.

Reset gate r_t and update gate z_t are computed as follows:

Where,

𝜎 is the sigmoid function
W_r, W_z, b_r, and b_z are learnable parameters
[h_{t-1}, x_t] denotes the concatenation of h_{t-1} and x_t

Next step is to compute the candidate activation vector ~h_t:

Recommended by LinkedIn

Understanding Convolutional Neural Networks (CNNs):…

Rany ElHousieny, PhDᴬᴮᴰ 10 months ago

Deep neural networks as a composite function and the…

Ajit Jaokar 6 months ago

BxD Primer Series: Transformer Models

Mayank K. 1 year ago

Where,

W, h_r and b are learnable parameters
h_r is the element-wise product (also called Hadamard product) of r_t and h_{t-1}

Final output h_t is then computed as a weighted average of previous output h_{t-1} and candidate activation vector ~h_t, using the update gate z_t to determine the weights:

This equation allows the GRU to selectively update and forget information over time, based on the current input and previous output.

Note 1: Candidate hidden state is a temporary memory that stores the information gathered from current input and previous hidden state.

Difference between hidden state and output:

Hidden state represents the memory of network and is updated at each time step based on input and previous hidden state. It contains information about the input sequence processed so far.
Output, on the other hand, is the prediction or classification of network and is computed based on the updated hidden state. It is a function of the hidden state and can be used for tasks such as predicting the next word in a sentence or classifying a sequence into a category.

Difference between unidirectional and bidirectional GRU:

A unidirectional GRU processes input sequence in a forward direction only, whereas a bidirectional GRU processes input sequence in both forward and backward directions.

In a unidirectional GRU, output at each time step is based only on the information from previous time steps. This makes the unidirectional GRU suitable for tasks where the context only flows in one direction.

In contrast, a bidirectional GRU processes input sequence in both forward and backward directions, allowing it to capture information from both past and future contexts. This makes bidirectional GRU more suitable for tasks where the context flows bidirectionally, such as machine translation, where the meaning of a word in a sentence may depend on words that come both before and after it.

Output of a bidirectional GRU is typically computed by concatenating the hidden states of forward and backward GRUs at each time step.

Relationship between input sequence size and GRU layer size:

Input sequence size refers to the number of time steps in a sequence of input data that is fed to GRU layer. GRU layer size refers to the number of GRU units or cells in a single GRU layer.

If input sequence size is too large relative to GRU layer size, the model have difficulty learning long-term dependencies in data. This is because each GRU unit in layer is responsible for encoding a specific time step in sequence, and if there are too many time steps for each unit to handle, the model may not be able to capture relevant patterns in data.

On the other hand, if input sequence size is too small relative to GRU layer size, the model can overfit the data, as there may not be enough information in input to support the number of parameters in model. This leads to poor generalization performance on unseen data.

It is also important to note that increasing number of GRU layers can compensate for a smaller GRU layer size, as each additional layer can learn more complex patterns in data.

The Why:

Reasons for using GRUs:

Long-term dependencies: GRUs are able to capture dependencies between inputs that are far apart in time.
Fewer parameters: Compared to other recurrent neural network architectures such as LSTMs, GRUs have fewer parameters, which makes them faster to train and easier to use in practice.
Less prone to vanishing gradients: Gating mechanism in GRUs allows them to maintain gradients over longer periods of time, which makes them more stable during training.
Lower computational cost: GRUs have a simpler architecture compared to LSTMs and other types of recurrent neural networks, which makes them computationally less expensive and faster to train.
Suitable for online learning: GRUs can be trained incrementally in an online fashion, which makes them suitable for applications where the data arrives in a streaming fashion, such as speech recognition or video analysis.

The Why Not:

Reasons for not using GRUs:

Limited control over memory: While GRUs are designed to capture long-term dependencies, they have limited control over the contents of their memory and can not store information as precisely as LSTMs.
Limited expressiveness: Simpler architecture of GRUs compared to LSTMs and other types of recurrent neural networks limit their expressiveness and ability to capture complex patterns in the input data.
Limited parallelization: The sequential nature of recurrent neural networks, including GRUs, makes it difficult to parallelize their computations, which limit their scalability and make them less suitable for very large datasets.

Time for you to support:

Reply to this email with your question
Forward/Share to a friend who can benefit from this
Chat on Substack with BxD (here)
Engage with BxD on LinkedIN (here)

In next edition, we will cover Autoencoder Neural Networks.

Let us know your feedback!

Until then,

Have a great time! 😊

#businessxdata #bxd #GRU #neuralnetworks #primer

BxD Primer Series: Gated Recurrent Unit (GRU) Neural Networks

Mayank K.

Founding Partner - BUSINESS x DATA (Implementing AI-Driven Personalization at Scale)

The What:

Basic Architecture:

Difference between GRU and LSTM:

The How:

Recommended by LinkedIn

Difference between unidirectional and bidirectional GRU:

Relationship between input sequence size and GRU layer size:

The Why:

The Why Not:

Time for you to support:

BUSINESS x DATA

762 followers

More articles by Mayank K.

Insights from the community

Others also viewed

BxD Primer Series: Convolutional Neural Networks

BxD Primer Series: Variational Autoencoder (VAE) Neural Networks

Recurrent Neural Networks in Deep Learning — Part 1

BxD Primer Series: Recurrent Neural Networks

BxD Primer Series: Hopfield Neural Networks

BxD Primer Series: Long Short-Term Memory (LSTM) Neural Networks

BxD Primer Series: Liquid State Machine (LSM) Neural Networks

Comparative Analysis: ARIMA's Box-Jenkins Approach vs. LSTM's Neural Network Structure in Time Series Forecasting

BxD Primer Series: Capsule Neural Networks

BxD Primer Series: Radial Basis Neural Networks

Explore topics

The What:

Basic Architecture:

Difference between GRU and LSTM:

The How:

Recommended by LinkedIn

Difference between unidirectional and bidirectional GRU:

Relationship between input sequence size and GRU layer size:

The Why:

The Why Not:

Time for you to support:

BUSINESS x DATA

762 followers

More articles by Mayank K.

What we look for in new recruits?

500+ Enrollments, 🌟🌟🌟🌟🌟 Ratings and a Podcast

What you mean 'Build A Business'?

Why 'AI-Driven Personalization' niche?

Entering the next chapter of BxD

We are ranking #1

My favorites from the new release

Many senior level jobs inside....

People need more jobs and videos.

BxD Saturday Letter #202425

Insights from the community

Others also viewed

BxD Primer Series: Convolutional Neural Networks

BxD Primer Series: Variational Autoencoder (VAE) Neural Networks

Recurrent Neural Networks in Deep Learning — Part 1

BxD Primer Series: Recurrent Neural Networks

BxD Primer Series: Hopfield Neural Networks

BxD Primer Series: Long Short-Term Memory (LSTM) Neural Networks

BxD Primer Series: Liquid State Machine (LSM) Neural Networks

Comparative Analysis: ARIMA's Box-Jenkins Approach vs. LSTM's Neural Network Structure in Time Series Forecasting

BxD Primer Series: Capsule Neural Networks

BxD Primer Series: Radial Basis Neural Networks

Explore topics