BxD Primer Series: Gated Recurrent Unit (GRU) Neural Networks
Hey there 👋
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on Gated Recurrent Unit (GRU) Neural Networks. Let’s get started:
The What:
GRU is a variant of the Long Short-Term Memory (LSTM) network, another type of RNN that is widely used for sequence modeling tasks. Check our editions on RNN here and LSTM here.
Like LSTM, GRU is designed to address the vanishing gradient problem that can occur when training RNNs. Vanishing gradient is a problem where gradients become very small as they are propagated back through the network during training, which makes it difficult to learn long-term dependencies in data.
GRU addresses this problem by using a gating mechanism that allows it to selectively update and reset its hidden state. This gating mechanism is simpler than the one used in LSTM, which makes GRU easier to train and faster to compute.
The core idea of the GRU is to use two gates: an update gate and a reset gate. These gates are used to control the flow of information through the network, allowing it to selectively remember or forget previous information.
The update gate and reset gate are both sigmoid functions that take as input the concatenation of the previous hidden state and the current input. The output of these gates is then used to update hidden state of network.
Basic Architecture:
GRU has become a popular choice for sequence modeling tasks such as language modeling, machine translation, and speech recognition due to its simplicity, efficiency, and effectiveness.
Here is a ‘time unfolded’ version of GRN network:
In above diagram:
Difference between GRU and LSTM:
GRU (Gated Recurrent Unit) and LSTM (Long Short-Term Memory) are both popular types of recurrent neural networks (RNNs) that are designed to model sequential data. They are both capable of processing variable-length sequences of input data and maintaining information over long time intervals.
The primary difference between GRU and LSTM lies in their respective architectures. While both networks have gates that control the flow of information, the gating mechanisms in GRU are simpler and more streamlined than those in LSTM.
GRU uses only two gates: an update gate and a reset gate.
LSTM uses three gates: an input gate, an output gate, and a forget gate.
The How:
GRUs can be broken down into several equations that are computed at each time step t.
Let x_t be the input at time step t, and h_{t-1} be the output of previous time step.
Reset gate r_t and update gate z_t are computed as follows:
Where,
Next step is to compute the candidate activation vector ~h_t:
Recommended by LinkedIn
Where,
Final output h_t is then computed as a weighted average of previous output h_{t-1} and candidate activation vector ~h_t, using the update gate z_t to determine the weights:
This equation allows the GRU to selectively update and forget information over time, based on the current input and previous output.
Note 1: Candidate hidden state is a temporary memory that stores the information gathered from current input and previous hidden state.
Difference between hidden state and output:
Difference between unidirectional and bidirectional GRU:
A unidirectional GRU processes input sequence in a forward direction only, whereas a bidirectional GRU processes input sequence in both forward and backward directions.
In a unidirectional GRU, output at each time step is based only on the information from previous time steps. This makes the unidirectional GRU suitable for tasks where the context only flows in one direction.
In contrast, a bidirectional GRU processes input sequence in both forward and backward directions, allowing it to capture information from both past and future contexts. This makes bidirectional GRU more suitable for tasks where the context flows bidirectionally, such as machine translation, where the meaning of a word in a sentence may depend on words that come both before and after it.
Output of a bidirectional GRU is typically computed by concatenating the hidden states of forward and backward GRUs at each time step.
Relationship between input sequence size and GRU layer size:
Input sequence size refers to the number of time steps in a sequence of input data that is fed to GRU layer. GRU layer size refers to the number of GRU units or cells in a single GRU layer.
If input sequence size is too large relative to GRU layer size, the model have difficulty learning long-term dependencies in data. This is because each GRU unit in layer is responsible for encoding a specific time step in sequence, and if there are too many time steps for each unit to handle, the model may not be able to capture relevant patterns in data.
On the other hand, if input sequence size is too small relative to GRU layer size, the model can overfit the data, as there may not be enough information in input to support the number of parameters in model. This leads to poor generalization performance on unseen data.
It is also important to note that increasing number of GRU layers can compensate for a smaller GRU layer size, as each additional layer can learn more complex patterns in data.
The Why:
Reasons for using GRUs:
The Why Not:
Reasons for not using GRUs:
Time for you to support:
In next edition, we will cover Autoencoder Neural Networks.
Let us know your feedback!
Until then,
Have a great time! 😊