BxD Primer Series: Gated Recurrent Unit (GRU) Neural Networks

BxD Primer Series: Gated Recurrent Unit (GRU) Neural Networks

Hey there 👋

Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on Gated Recurrent Unit (GRU) Neural Networks. Let’s get started:

The What:

GRU is a variant of the Long Short-Term Memory (LSTM) network, another type of RNN that is widely used for sequence modeling tasks. Check our editions on RNN here and LSTM here.

Like LSTM, GRU is designed to address the vanishing gradient problem that can occur when training RNNs. Vanishing gradient is a problem where gradients become very small as they are propagated back through the network during training, which makes it difficult to learn long-term dependencies in data.

GRU addresses this problem by using a gating mechanism that allows it to selectively update and reset its hidden state. This gating mechanism is simpler than the one used in LSTM, which makes GRU easier to train and faster to compute.

The core idea of the GRU is to use two gates: an update gate and a reset gate. These gates are used to control the flow of information through the network, allowing it to selectively remember or forget previous information.

  • Update gate determines how much of the previous hidden state to keep and how much of new candidate state to use.
  • Reset gate determines how much of the previous hidden state to forget and how much of new candidate state to use.

The update gate and reset gate are both sigmoid functions that take as input the concatenation of the previous hidden state and the current input. The output of these gates is then used to update hidden state of network.

Basic Architecture:

GRU has become a popular choice for sequence modeling tasks such as language modeling, machine translation, and speech recognition due to its simplicity, efficiency, and effectiveness.

Here is a ‘time unfolded’ version of GRN network:

No alt text provided for this image

In above diagram:

  • t is the current time step
  • x(t) is the input at current time step
  • h(t) is the output of previous time step
  • r(t) is the reset gate
  • z(t) is the update gate
  • h(t+1) is the new output at current time step

Difference between GRU and LSTM:

GRU (Gated Recurrent Unit) and LSTM (Long Short-Term Memory) are both popular types of recurrent neural networks (RNNs) that are designed to model sequential data. They are both capable of processing variable-length sequences of input data and maintaining information over long time intervals.

The primary difference between GRU and LSTM lies in their respective architectures. While both networks have gates that control the flow of information, the gating mechanisms in GRU are simpler and more streamlined than those in LSTM.

GRU uses only two gates: an update gate and a reset gate.

  • Update gate controls the flow of information from previous hidden state to current hidden state
  • Reset gate controls how much of the past hidden state to forget. Candidate hidden state is then computed using current input and reset gate, and Updated hidden state is obtained by combining the candidate hidden state and previous hidden state using update gate.

LSTM uses three gates: an input gate, an output gate, and a forget gate.

  • Input gate controls the flow of new information into the cell
  • Forget gate controls the flow of information from previous cell state
  • Output gate controls the flow of information from current cell state to next hidden state. Cell state is updated using the input and forget gates, and hidden state is obtained by combining current cell state and output gate.

The How:

GRUs can be broken down into several equations that are computed at each time step t.

Let x_t be the input at time step t, and h_{t-1} be the output of previous time step.

Reset gate r_t and update gate z_t are computed as follows:

No alt text provided for this image
No alt text provided for this image

Where,

  • 𝜎 is the sigmoid function
  • W_r, W_z, b_r, and b_z are learnable parameters
  • [h_{t-1}, x_t] denotes the concatenation of h_{t-1} and x_t

Next step is to compute the candidate activation vector ~h_t:

No alt text provided for this image

Where,

  • Wh_r and b are learnable parameters
  • h_r is the element-wise product (also called Hadamard product) of r_t and h_{t-1}

Final output h_t is then computed as a weighted average of previous output h_{t-1} and candidate activation vector ~h_t, using the update gate z_t to determine the weights:

No alt text provided for this image

This equation allows the GRU to selectively update and forget information over time, based on the current input and previous output.

Note 1: Candidate hidden state is a temporary memory that stores the information gathered from current input and previous hidden state.

Difference between hidden state and output:

  • Hidden state represents the memory of network and is updated at each time step based on input and previous hidden state. It contains information about the input sequence processed so far.
  • Output, on the other hand, is the prediction or classification of network and is computed based on the updated hidden state. It is a function of the hidden state and can be used for tasks such as predicting the next word in a sentence or classifying a sequence into a category.

Difference between unidirectional and bidirectional GRU:

A unidirectional GRU processes input sequence in a forward direction only, whereas a bidirectional GRU processes input sequence in both forward and backward directions.

In a unidirectional GRU, output at each time step is based only on the information from previous time steps. This makes the unidirectional GRU suitable for tasks where the context only flows in one direction.

In contrast, a bidirectional GRU processes input sequence in both forward and backward directions, allowing it to capture information from both past and future contexts. This makes bidirectional GRU more suitable for tasks where the context flows bidirectionally, such as machine translation, where the meaning of a word in a sentence may depend on words that come both before and after it.

Output of a bidirectional GRU is typically computed by concatenating the hidden states of forward and backward GRUs at each time step.

Relationship between input sequence size and GRU layer size:

Input sequence size refers to the number of time steps in a sequence of input data that is fed to GRU layer. GRU layer size refers to the number of GRU units or cells in a single GRU layer.

If input sequence size is too large relative to GRU layer size, the model have difficulty learning long-term dependencies in data. This is because each GRU unit in layer is responsible for encoding a specific time step in sequence, and if there are too many time steps for each unit to handle, the model may not be able to capture relevant patterns in data.

On the other hand, if input sequence size is too small relative to GRU layer size, the model can overfit the data, as there may not be enough information in input to support the number of parameters in model. This leads to poor generalization performance on unseen data.

It is also important to note that increasing number of GRU layers can compensate for a smaller GRU layer size, as each additional layer can learn more complex patterns in data.

The Why:

Reasons for using GRUs:

  1. Long-term dependencies: GRUs are able to capture dependencies between inputs that are far apart in time.
  2. Fewer parameters: Compared to other recurrent neural network architectures such as LSTMs, GRUs have fewer parameters, which makes them faster to train and easier to use in practice.
  3. Less prone to vanishing gradients: Gating mechanism in GRUs allows them to maintain gradients over longer periods of time, which makes them more stable during training.
  4. Lower computational cost: GRUs have a simpler architecture compared to LSTMs and other types of recurrent neural networks, which makes them computationally less expensive and faster to train.
  5. Suitable for online learning: GRUs can be trained incrementally in an online fashion, which makes them suitable for applications where the data arrives in a streaming fashion, such as speech recognition or video analysis.

The Why Not:

Reasons for not using GRUs:

  1. Limited control over memory: While GRUs are designed to capture long-term dependencies, they have limited control over the contents of their memory and can not store information as precisely as LSTMs.
  2. Limited expressiveness: Simpler architecture of GRUs compared to LSTMs and other types of recurrent neural networks limit their expressiveness and ability to capture complex patterns in the input data.
  3. Limited parallelization: The sequential nature of recurrent neural networks, including GRUs, makes it difficult to parallelize their computations, which limit their scalability and make them less suitable for very large datasets.

Time for you to support:

  1. Reply to this email with your question
  2. Forward/Share to a friend who can benefit from this
  3. Chat on Substack with BxD (here)
  4. Engage with BxD on LinkedIN (here)

In next edition, we will cover Autoencoder Neural Networks.

Let us know your feedback!

Until then,

Have a great time! 😊

#businessxdata #bxd #GRU #neuralnetworks #primer

To view or add a comment, sign in

More articles by Mayank K.

  • What we look for in new recruits?

    What we look for in new recruits?

    Personalization is the #1 use case of most of AI technology (including Generative AI, Knowledge Graphs…

  • 500+ Enrollments, 🌟🌟🌟🌟🌟 Ratings and a Podcast

    500+ Enrollments, 🌟🌟🌟🌟🌟 Ratings and a Podcast

    We are all in for AI Driven Marketing Personalization. This is the niche where we want to build this business.

  • What you mean 'Build A Business'?

    What you mean 'Build A Business'?

    We are all in for AI Driven Personalization in Business. This is the niche where we want to build this business.

  • Why 'AI-Driven Personalization' niche?

    Why 'AI-Driven Personalization' niche?

    We are all in for AI Driven Personalization in Business. In fact, this is the niche where we want to build this…

  • Entering the next chapter of BxD

    Entering the next chapter of BxD

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

    1 Comment
  • We are ranking #1

    We are ranking #1

    We are all in for AI Driven Personalization in Business. And recently we created a course about it.

  • My favorites from the new release

    My favorites from the new release

    The Full version of BxD newsletter has a new home. Subscribe on LinkedIn: 🌟 https://www.

  • Many senior level jobs inside....

    Many senior level jobs inside....

    Hi friend - As you know, we recently completed 100 editions of this newsletter and I was the primary publisher so far…

  • People need more jobs and videos.

    People need more jobs and videos.

    From the 100th edition celebration survey conducted last week- one point is standing out that people need more jobs and…

  • BxD Saturday Letter #202425

    BxD Saturday Letter #202425

    Please take 2 mins to send your feedback. Link: https://forms.

Insights from the community

Others also viewed

Explore topics