BxD Primer Series: Transformer Models
Hey there 👋
Welcome to BxD Primer Series where we are covering topics such as Machine learning models, Neural Nets, GPT, Ensemble models, Hyper-automation in ‘one-post-one-topic’ format. Today’s post is on Transformer Models. Let’s get started:
Introduction to Transformer:
The name "transformer" comes from a key component of its architecture, known as self-attention or transformer attention mechanism.
This mechanism allows the model to capture relationships between different positions in input sequence, allowing it to attend and "transform" the representation of sequence based on the context and dependencies.
The term "transformer" was introduced by Google researchers in a paper "Attention is All You Need" in 2017, where they presented a novel architecture for sequence-to-sequence tasks that relied heavily on self-attention mechanisms. The name stuck and has since been widely used to refer to this specific architecture.
We have already covered the basics of Attention Mechanism in previous edition and will focus on elements specific to Transformer architecture in this edition. Please make sure to read previous edition first.
Note: Traditional recurrent neural networks, LSTMs, GRUs were widely used in sequence-to-sequence tasks before transformers came into picture. However, they have limitations in capturing long-range dependencies
The What:
Transformer Model is a type of neural network architecture that consists of two main components: an Encoder and a Decoder. Both components can also work in isolation depending on task at hand.
Encoder is responsible for processing input sequence, a sentence or a document, and converting it into a set of context vectors that capture key information in input. It typically consists of several layers of self-attention and feed-forward neural networks, which allow the model to capture both the content and order of input sequence.
Decoder is responsible for generating output sequence, a translated sentence or a summary of input, based on the context vectors generated by Encoder. It typically consists of several layers of self-attention and feed-forward neural networks, as well as an additional multi-head attention mechanism that allows the model to focus on different parts of input sequence while generating output.
Encoder and Decoder are trained jointly to minimize a loss function by adjusting the weights of neural network using back-propagation. Trained model can be used to generate output sequences for new input sequences.
Transformer Models are known for their ability to handle long input and output sequences, to capture complex relationships between input and output, and to generalize to new input and output sequences.
Key Features of Transformer Models:
Unique features that make transformer well-suited for sequence-sequence tasks:
Applications of Transformer Models:
Transformer Models perform extraordinarily well for sequence-sequence tasks:
And many more…
Self-Attention Mechanism:
Self-attention computes attention scores between all pairs of tokens in a sequence to capture their dependencies. It assign weights to different tokens based on their relevance to other tokens in sequence.
Self-attention mechanism has three components:
Attention scores are calculated by taking normalized dot product between query and key vectors and applying a softmax function to obtain a distribution over all tokens. The weighted sum of value vectors, using attention scores as weights, yields the final representation of token.
Where:
Positional Encoding:
Positional encoding is a set of additional embeddings added to input embeddings, to differentiate tokens based on their positions. Positional encodings are combined with token embeddings to create input representation for transformer.
A commonly used method for positional encoding is sine and cosine functions with different frequencies:
Where,
Trigonometric sine and cosine functions with varying frequencies ensures that different positions receive unique encoding patterns. Different frequencies allows to capture different scales or rates of change along sequence, to distinguish between positions more effectively.
Multi-Head Attention:
Multi-head attention is used to capture different types of dependencies and attend to different parts of input sequence simultaneously. It involves performing self-attention multiple times in parallel, with each head having its own weight matrices for query, key, and value projections. The outputs of all heads are concatenated and linearly transformed to obtain final attention output.
The mathematical equations for multi-head attention are as follows:
Where,
✪ Single-headed attention means there is only one set of attention weights used to compute context vector at each time step and model can only focus on a single aspect of input sequence at a time.
✪ Multi-headed attention uses multiple sets of attention weights to compute context vector. Each attention head focuses on a different aspect of input sequence, which allows to capture different types of information at different levels of granularity. For example, some attention heads focus on global context of input sequence, while others focus on specific local features.
Attention Mask:
Attention mask controls the flow of information during self-attention mechanism. It determines which positions or tokens in input sequence should be attended to or ignored by model.
Attention mask ensures that the model attends only to valid positions and ignores future or padding tokens. It is a square matrix of shape (N, N), where N is the length of input sequence.
Let's denote this matrix as M. Each element M[i, j] of mask matrix can take two values:
Attention mask is applied during attention scores calculation by element-wise multiplication with the weights assigned to each token by self-attention mechanism.
Weights corresponding to M[i, j]=0 are set to very large negative value (-inf) before applying softmax activation function. This causes attention weights to effectively become zero, indicating that the model should not attend to those positions.
✪ Padding tokens are sometimes added to ensure that all input sequences have the same length. Encoder masking is used to mask out padding tokens in input sequence.
✪ Decoder masking, on the other hand, is used to mask out future tokens in output sequence.
Basic Architecture Diagram:
The How:
Processing of input token by encoder-decoder transformer model (in sequential order):
✪ Input Encoding: Let input sequence be denoted as X = {x1, x2, …, xn}, where each xi represents a token.
Each token is embedded into a continuous vector representation using an embedding matrix E of dimensions (d_model × V), where d_model is the embedding dimension and V is the vocabulary size.
Embedded input sequence is denoted as X_embed = {e1, e2, ….., en}.
✪ Encoder consists of L identical layers. Each layer has two sub-layers: self-attention and feed-forward networks, both followed by residual connection and layer normalization.
a) Self-Attention: For each token e_i in input sequence, compute the query, key, and value vectors as follows:
Query = e_i × W_Q
Key = e_i × W_K
Value = e_i × W_V
Where W_Q, W_K, W_V are learnable weight matrices of dimension (d_model × d_k), and d_k is the dimension of key, query, and value vectors.
The self-attention mechanism calculates attention weights for token e_i by applying dot product between its query and key vectors of all tokens:
The attended output for token e_i is obtained by weighting value vectors of all tokens:
Recommended by LinkedIn
b) Residual Connection and Layer Normalization: The output of self-attention sub-layer is combined with input embedding through a residual connection:
Layer normalization is applied to normalize the output:
c) Feed-Forward Networks: Normalized output is passed through a feed-forward neural network consisting of two linear transformations:
Where,
d) Residual Connection and Layer Normalization: The output of feed-forward sub-layer is combined with normalized output using a residual connection, followed by layer normalization:
✪ Decoder also consists of L identical layers, but with more sub-layers compared to encoder.
a) Self-Attention: Similar to encoder self-attention, key, query, and value vectors are computed for each token in the decoder:
Query = DecoderEmbedding × W_Q
Key = DecoderEmbedding × W_K
Value = DecoderEmbedding × W_V
Attention weights and attended output are calculated as before.
b) Encoder-Decoder Attention: The decoder attends to encoder output using encoder-decoder attention mechanism. Key, query, and value vectors for each token in decoder are computed as:
Query = EncoderOutput × W_Q
Key = EncoderOutput × W_K
Value = EncoderOutput × W_V
Attention weights and attended output are calculated as before.
c) Residual Connection and Layer Normalization are applied to outputs of self-attention and encoder-decoder attention sub-layers, similar to the encoder.
d) Feed-Forward Networks: Normalized output is passed through a feed-forward neural network, similar to the encoder.
e) Residual Connection and Layer Normalization to outputs of feed forward network, similar to the encoder.
✪ Output Projection: Final output of decoder is projected to the desired output dimension using a linear projection layer:
where W_O and b_O are learnable weight matrix and bias term
✪ Training and Inference: The model is trained to minimize a defined loss function, such as cross-entropy loss, using back-propagation and gradient descent algorithms. Model parameters are updated iteratively.
During inference or generation, the model predicts output tokens auto-regressively.
Generated tokens are fed back as input to predict subsequent tokens until a special end-of-sequence token is generated or a predefined maximum sequence length is reached.
Note 1: Temperature scaling is a simple technique to control diversity of transformer output. During generation, the model outputs a probability distribution over vocabulary. Temperature parameter controls the "temperature" of distribution, and hence the diversity of generated text. A higher temperature leads to more diverse and creative text and a lower temperature leads to more conservative and repetitive text.
Note 2: Residual connections, also known as skip connections, address the issue of vanishing gradients and allow information and gradients to propagate more effectively during training.
Note 3: Layer normalization is used to normalize outputs of each sub-layer, improving stability and convergence
Beam Search:
Beam search is a decoding algorithm used in sequence generation tasks. It helps to select most probable output sequence given input sequence and trained transformer model. Instead of naively selecting the highest probability token at each step, beam search maintains a beam of top-k candidates and explores multiple possible paths to find most likely sequence.
Beam search algorithm works this way:
➤ Initialization: Set the beam width (k), which determines number of candidate sequences to consider at each decoding step. Initialize the beam with a start token or input sequence.
➤ Decoding Steps: Perform a series of steps until termination condition is met (reaching maximum sequence length or encountering an end token).
a. Generate candidates: For each candidate sequence in beam, predict the probabilities of next possible tokens using transformer model. Use softmax function for token probabilities.
b. Update the beam: For each candidate, multiply its probability by the probability of its parent sequence (cumulative probability) to score the sequence so far. Keep track of top-k sequences with highest cumulative probabilities.
c. Early stopping: If any of top-k sequences end with an end token, it is considered complete sequence. You can also enforce a minimum length criterion to avoid overly short sequences. If the number of complete sequences reaches k, terminate the decoding process.
➤ Final Selection: Once termination condition is met, select the sequence with highest cumulative probability from the set of complete sequences as final output.
Beam search provides a trade-off between quality and diversity of generated sequences. By adjusting the beam width (k), you can control level of exploration versus exploitation.
Pre-Training in Transformer Models:
Pre-training is unsupervised training of transformer model on a large, diverse dataset, such as a collection of web pages or a corpus of books. During pre-training, the model learns to recognize patterns in input text and encode that information into context vectors.
After pre-training, the model can be fine-tuned on a smaller dataset for a specific task. During fine-tuning, model weights are adjusted to optimize performance on new task, while preserving the general language understanding learned during pre-training.
✪ Two main techniques of pre-training:
✪ Examples of pre-trained transformer models:
The Why:
Reasons to use Encoder-Decoder Transformer Models:
The Why Not:
Reasons to not use Transformer Models:
Time for you to support:
In next edition, we will cover Transfer Learning Techniques.
Let us know your feedback!
Until then,
Have a great time! 😊