Demystifying Tokenization: Preparing Data for Large Language Models (LLMs)

Demystifying Tokenization: Preparing Data for Large Language Models (LLMs)

Unlocking the Code: How Tokenization Powers NLP

In the evolving realm of Natural Language Processing (NLP), the term 'tokenization' often surfaces as a cornerstone concept. But what does tokenization truly entail, and how does it prep the vast landscape of human language for the analytical eyes of Large Language Models (LLMs)?

Note: his article is part of the following Medium article:
Note: We will be using Google Colaboratory Python notebooks to avoid setup and environment delays. The focus on this article to get you up and running in Machine Learning with Python and we can do all what we need there. The following article explains how to use it:


The Tokenization Process

Imagine a bridge connecting human language and machine understanding; tokenization is the very construction of this bridge. It's a process that breaks down text into bite-sized pieces—tokens—that a machine can digest. These tokens are the atoms of our language universe when it comes to machine learning: each word, subword, or even character can serve as a token.

But how do we translate Shakespearean prose or everyday banter into a numerical sequence that a computer can understand? This is where tokenization shines, encoding words into integers and birthing a language that algorithms can interpret.

Tokenization is a critical first step in preparing data for Large Language Models (LLMs) because these models don't understand raw text; they process numerical data. The tokenizer's role is to convert text into numbers that the model can understand.

Here's a simplified step-by-step explanation:

  1. Input Text: You start with the text you want the model to process, e.g., "Fine Tuning is fun for all!"
  2. Tokenization: The tokenizer breaks down the text into smaller pieces called tokens. These tokens can be words, parts of words, or even characters, depending on the tokenizer's design. For example:Word Tokenization: ["Fine", "Tuning", "is", "fun", "for", "all", "!"]Subword Tokenization: ["Fine", "Tun", "##ing", "is", "fun", "for", "all", "!"] The "##" indicates that "ing" is part of the previous word.
  3. Token IDs: Each token is then mapped to a unique ID based on a vocabulary file that the tokenizer uses. This file contains a list of all tokens known to the model, each with a unique ID. For instance:"Fine" → 34389"Tun" → 13932 "ing"→ 278""is"" → 318"fun" → 1257"for" → 329"all" → 477, "!" → 0
  4. Encoding: The process of converting tokens into token IDs is called encoding. The sentence "Fine Tuning is fun for all!" might be encoded as [34389, 13932, 278, 318, 1257, 329, 477, 0].
  5. Model Processing: These token IDs are fed into the LLM, which processes them to perform tasks like translation, question-answering, or text generation.
  6. Decoding: If the model outputs token IDs (like when generating text), these IDs need to be converted back into human-readable text. This process is called decoding.

In summary, tokenization is about converting text to numbers and back, aligning the process with the LLM's understanding. It's crucial to use the correct tokenizer for the model to ensure that the text is tokenized in the way the model has been trained to understand.

Before we start, let's clarify some concepts:

  • Tokenization: This is like breaking a sentence into individual words or parts of words. Imagine you have a string of pearls, and each pearl represents a word in a sentence. Tokenization would be the process of identifying each pearl.
  • Tokens: These are the pieces you get after tokenization, much like the individual pearls. In the context of NLP, tokens could be whole words or parts of words.
  • Token IDs: Each token is given a unique number, called a token ID, that a computer can understand. Think of it as assigning a number to each pearl in your string.
  • Truncation: If we have a limit on how many pearls (words) we can keep from our string, truncation is the process of removing pearls to meet this limit. "Left-side truncation" specifically means we start removing pearls from the left (the beginning of the string).
  • Padding adds extra tokens to shorter texts to match the length of the longest one.

Python Examples and Explanations

Let's walk through some Python code to see tokenization in action using the Hugging Face transformers library:


from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pythia-70m")

text = "My name is Rany ElHousieny."

encoded_text = tokenizer(text)["input_ids"]

print(encoded_text)
        


In the snippet above, we initiate a tokenizer tailored for the "EleutherAI/pythia-70m" model. The text "My name is Rany ElHousieny." is then tokenized, meaning it's split into tokens and each token is mapped to a unique ID.


To check the fidelity of this process, we decode the tokens back into text, ensuring the original message remains intact.

decoded_text = tokenizer.decode(encoded_text)

print("Decoded tokens back into text: ", decoded_text)        
Decoded tokens back into text:  My name is Rany ElHousieny.        


But what about handling multiple sentences? Or ensuring they fit a specific format? Let's explore:

list_texts = ["Hi, how are you?", "I'm good", "Yes"]

encoded_texts = tokenizer(list_texts)

print("Encoded several texts: ", encoded_texts["input_ids"])
        
Encoded several texts:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175], [4374]]
        

Tokenization isn't a one-size-fits-all; sometimes, we deal with texts of varying lengths. Here, the tokenizer processes multiple sentences at once, assigning each token an ID while maintaining the individuality of each sentence.

For a uniform format, especially when batching inputs for training or inference, we use padding and truncation:

Padding:

Padding in the context of machine learning, and more specifically in natural language processing (NLP), is a technique used to ensure that all sequences (e.g., sentences or documents) in a batch are of the same length. Since most machine learning models, including neural networks, require inputs to be of uniform size, padding is crucial when working with text data where the natural length of sentences varies.

Here's a step-by-step explanation of padding:

  1. Identifying the Need for Padding: When you have a batch of sentences of different lengths and you need to process them through a model (like an LLM), you can't feed them into the model as-is because the model expects inputs to be of a consistent size.
  2. Determining the Sequence Length: You decide on a fixed length for your sequences. This could be the length of the longest sentence in your batch or a predefined maximum length that you choose based on your model's requirements or computational limitations.
  3. Applying Padding: For sentences that are shorter than the fixed length, you add a special token, commonly known as a padding token, to the end of the sentence until it reaches the desired length. If the model's maximum sequence length is 10 tokens, and you have a sentence with only 7 tokens, you would add 3 padding tokens to make it 10 tokens long.
  4. Uniform Input Size: After padding, all sentences in your batch have the same number of tokens, which means they can be processed as a single batch by your model. This is essential for parallel processing on GPUs, which greatly speeds up training and inference times.
  5. Special Tokens: The padding token is typically a special token that is not part of the normal vocabulary. In many tokenization systems, it is represented by the ID '0'. It's important that the model knows to ignore these padding tokens during processing so that they don't affect the outcome.
  6. Masking: In some models, a mask is used in tandem with padding. The mask indicates to the model which tokens are real and which are padding, allowing the model to only pay attention to the real data.

Here is an illustration of the padding process:

Without padding:

Sentence 1: [The cat sat on the mat] -> [Token IDs: 23, 567, 234, 53, 12, 3456]

Sentence 2: [A dog] -> [Token IDs: 45, 678]
        


With padding (to length 6):

Sentence 1: [23, 567, 234, 53, 12, 3456] (No padding needed)

Sentence 2: [45, 678, 0, 0, 0, 0] (Padded with four '0's to reach length 6)
        

Padding allows for the efficient and effective training of neural networks on variable-length text data, ensuring that each input to the model is treated consistently.

Let's continue the example that we started:

list_texts = ["Hi, how are you?", "I'm good", "Yes"]        

Define the Pad Token

tokenizer.pad_token = tokenizer.eos_token        

This line of code configures the tokenizer's padding token to be the same as its end-of-sequence (EOS) token = 0. Let's break it down for clarity:

  1. Tokenizer: As mentioned before, a tokenizer is used in natural language processing (NLP) to break down text into smaller units, or tokens, which are then used by machine learning models for tasks like text generation, classification, etc.
  2. Pad Token (pad_token): In NLP, padding is used to ensure that all sequences (like sentences or paragraphs) are of the same length when feeding them into a model. This is necessary because many machine learning models require input data of a consistent size. The pad token is a special token used to fill in the extra spaces for shorter sequences to match the length of the longest sequence in a batch.
  3. End-of-Sequence Token (eos_token): The EOS token is a special token used to signify the end of a sequence. This is particularly important in models that generate text, as it indicates where the model should consider the sequence (like a sentence or a paragraph) to be complete.
  4. tokenizer.pad_token = tokenizer.eos_token: This line of code sets the tokenizer's pad token to be the same as its EOS token. This means that the same token will be used both to pad shorter sequences and to signify the end of a sequence.

Why would you do this? In some models and contexts, it can be beneficial to treat the end of a sequence and padding as the same, especially in models where the distinction between padding and the end of a sequence is not crucial or might interfere with how the model processes the text. For example, in some generative models, having distinct padding and EOS tokens might lead to unwanted behavior during text generation, and using the same token for both can mitigate such issues.

This decision largely depends on the specific model architecture and the nature of the task at hand. It's a configuration choice that can impact how the model interprets and generates text.

encoded_texts_longest = tokenizer(list_texts, padding=True)

print("Using padding: ", encoded_texts_longest["input_ids"])        
Using padding:  [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175, 0, 0, 0], [4374, 0, 0, 0, 0, 0]]        

This snippet of code involves using a tokenizer to convert a list of text strings into a format suitable for input into a machine learning model, specifically dealing with padding. Here's a breakdown of what's happening:

  1. list_texts: This variable represents a list of text strings. Each item in the list is a separate piece of text (like a sentence or a paragraph) that you want to process with the tokenizer.
  2. tokenizer(list_texts, padding=True): This function call is where the actual tokenization happens. You pass the list of texts to the tokenizer, which processes each text string in the list. The padding=True argument tells the tokenizer to apply padding to the tokenized outputs.Padding: This ensures that all tokenized text outputs have the same length. The tokenizer will add a special padding token to shorter texts in the list until they match the length of the longest text. Padding is crucial when batching together sequences of different lengths, as many machine learning models require inputs to be of uniform size.
  3. encoded_texts_longest: The output of the tokenizer function call is stored in this variable. It's typically a dictionary containing various keys like "input_ids", "attention_mask", etc. "input_ids" are the numerical representations (token IDs) of the texts.
  4. print("Using padding: ", encoded_texts_longest["input_ids"]): This line prints the "input_ids" from the encoded_texts_longest dictionary. Essentially, it's showing the numerical representations of your tokenized texts, after they've been padded to ensure uniform length.

In summary, this code tokenizes a list of text strings, applies padding to make all tokenized outputs the same length, and then prints the numerical representations (input_ids) of these tokenized and padded texts. This process prepares the data for input into machine learning models, particularly those in NLP that require uniform input sizes.

Padding adds extra tokens to shorter texts to match the length of the longest one, while truncation cuts off tokens from longer texts to fit a predetermined size. This standardization is vital for LLMs to process data in batches efficiently.

Using padding: [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175, 0, 0, 0], [4374, 0, 0, 0, 0, 0]]
        

This output represents the token IDs after padding has been applied to ensure that all sequences are of the same length. Here's what's happening:

  • Padding Tokens: The 0 represents the padding token, which is added to sequences to bring them all up to the same length for batch processing. In this context, the padding token ID 0 is likely standing in for the model's designated padding token (such as [PAD]).
  • Three Sequences: We have three different sequences of token IDs corresponding to three different input texts. The tokenizer has converted each text into a sequence of numerical IDs (token IDs).
  • Longest Sequence: The length of the padded sequences is determined by the longest sequence. In this case, the longest sequence has six tokens ([12764, 13, 849, 403, 368, 32]), so all other sequences are padded with 0 to reach this length.
  • Final Sequences:The first sequence didn't require padding as it already had six tokens.The second sequence ([42, 1353, 1175]) was shorter, so three padding tokens were added to extend its length to six.The third sequence had only one token ID ([4374]), so five padding tokens were added.

Padding is crucial for models that require input data to be of a consistent size, allowing for efficient batch processing and easier handling by neural networks.

Using Truncation:

encoded_texts_truncation = tokenizer(list_texts, max_length=3, truncation=True)

print("Using truncation: ", encoded_texts_truncation["input_ids"])
        
Using truncation: [[12764, 13, 849], [42, 1353, 1175], [4374]]
        

This output shows the token IDs after truncation has been applied. Truncation is used to shorten sequences that are longer than the maximum sequence length allowed by the model or defined by the user. In this case, the maximum length appears to be three tokens.

  • Three Sequences: Each sequence represents the tokenized version of an input text, truncated to a maximum length of three tokens.
  • Truncated Sequences:The first sequence was probably longer initially but has been truncated to the first three tokens ([12764, 13, 849]).The second sequence ([42, 1353, 1175]) fits within the three-token limit, so no truncation was needed.The third sequence only had one token to begin with ([4374]), so it remains unchanged.

Truncation is necessary when dealing with models that have a fixed input size or when computational constraints limit the handling of longer sequences. It ensures that each input sequence complies with the required length, although at the risk of losing some information from the removed tokens.

Left Truncation

Sometimes, we might need to truncate from a specific side:

tokenizer.truncation_side = "left"

encoded_texts_truncation_left = tokenizer(list_texts, max_length=3, truncation=True)

print("Using left-side truncation: ", encoded_texts_truncation_left["input_ids"])
        
Using left-side truncation:  [[403, 368, 32], [42, 1353, 1175], [4374]]
        

Left-side truncation ensures that the end of the sentences, often containing crucial information, is preserved by removing tokens from the beginning.

Now, let's interpret the output:

  • The code that produced this output asked the computer to look at each sentence, turn it into tokens, and then give us the first three tokens (or less if the sentence is short). If a sentence had more than three words, it would cut off the extra words from the beginning.
  • If you recall, here are all the tokens: [12764, 13, 849, 403, 368, 32]
  • The numbers you see (403, 368, 32, etc) are the token IDs for the last three remaining tokens of each sentence. The sentences were:"Hi, how are you?""I'm good""Yes"
  • The tokenizer turned these sentences into lists of numbers: [[12764, 13, 849, 403, 368, 32], [42, 1353, 1175], [4374]]
  • For "Hi, how are you?", the tokenizer gave us [403, 368, 32]. These numbers represent the last three tokens of this sentence. Since the sentence had more tokens, some were removed from the beginning to make sure only three tokens remained. The following tokens were removed from the left [12764, 13, 849]"I'm good" was turned into [42, 1353, 1175]. This sentence didn't need any tokens to be removed because it already had three tokens."Yes" is a very short sentence with just one word, so the tokenizer just gave us one number [4374]. No truncation was needed here, and no extra numbers were added because the sentence is naturally shorter than three tokens.

So in summary, this process of tokenization and left-side truncation is about converting sentences to a series of numbers and ensuring that each series has no more than three numbers, even if that means cutting off some of the beginning numbers to meet our requirement.


And when you need both padding and truncation:

encoded_texts_both = tokenizer(list_texts, max_length=3, truncation=True, padding=True)

print("Using both padding and truncation: ", encoded_texts_both["input_ids"])
        
Using both padding and truncation:  [[403, 368, 32], [42, 1353, 1175], [4374, 0, 0]]
        

Here, we're enforcing a strict length requirement, ensuring all text entries fit within a set boundary, ready for LLMs to analyze. In the given Python code snippet, the tokenizer was instructed to ensure that all tokenized outputs have the same length (in this case, a maximum length of 3 tokens) by cutting off excess tokens (truncation) and filling in with extra tokens (padding) where necessary.

Here's what each part means:

  • Truncation: This reduces the length of the token sequences. If a sentence has more than the maximum specified tokens, it will be cut off to meet the length requirement. The max_length=3 parameter in the code ensures that only the first three tokens of each text will be kept.
  • Padding: This increases the length of the token sequences that are shorter than the maximum specified tokens by adding a special token. The padding=True parameter adds padding to the token sequences so that all sequences are of uniform length. The pad_token was set to be the eos_token (end of sentence token), which is commonly used to signify the end of a text sequence in language models.
  • The Output Array: Each inner array corresponds to the tokenized version of one of the input texts. The tokenizer has converted each text into a sequence of numerical IDs, where each number represents a specific token (word or subword) from the model's vocabulary:[403, 368, 32]: Represents the tokenized version of the first text, "Hi, how are you?". Since the original sentence is longer than three tokens, it has been truncated to fit the specified maximum length.[42, 1353, 1175]: Corresponds to the tokenized form of "I'm good". This sentence was shorter than the maximum length, but it didn't require padding because it naturally fits within the three-token limit.[4374, 0, 0]: Stands for the tokenized form of "Yes". This sentence is only one token long, so two padding tokens (indicated by 0, which is the ID for the eos_token or padding token in this context) were added to reach the maximum length.

So, in essence, the output shows the token IDs for the given texts, each adjusted to a uniform length by either truncating longer sequences or padding shorter ones, ready for batch processing by a machine-learning model.

The Effect of Truncation on Data and Training Quality

Truncation can significantly affect the training and quality of data in a machine learning model, particularly in natural language processing (NLP). Here are some of the impacts of truncation on the data and the subsequent model training:

Impacts on Data Quality:

  1. Loss of Context: Truncation can lead to the loss of important contextual information, especially if it removes parts of the text that contain key data. In left-side truncation, you might lose the beginning of sentences which could include subject information or other critical content.
  2. Distorted Meaning: When sentences are cut short, the remaining text might convey a different meaning than intended, potentially misleading the model during training.
  3. Inconsistent Information: If only a portion of the data is truncated (e.g., longer sentences), this can create inconsistency in the information that the model receives, potentially affecting its ability to learn patterns accurately.

Impacts on Model Training:

  1. Biased Learning: A model might become biased towards the patterns present in the truncated data. If truncation frequently removes a particular type of information, the model may underperform in recognizing or generating such information.
  2. Reduced Model Complexity: Truncation can sometimes be beneficial by simplifying the input data, which may reduce the complexity of the model needed to achieve good performance. For shorter input sequences, simpler models may suffice.
  3. Training Efficiency: Truncation can lead to shorter input sequences, which can make training faster and more memory-efficient. This is particularly important when dealing with very large datasets or when computational resources are limited.

Strategies to Mitigate Negative Effects:

  1. Adequate Sequence Length: Choose a maximum sequence length that captures the essential information of the data while still being computationally feasible.
  2. Strategic Truncation: Apply truncation in a way that preserves important information. For instance, in a question-answering task, ensure that the question and key parts of the answer are not truncated.
  3. Data Augmentation: Use techniques like back-translation or paraphrasing to enrich the dataset and mitigate any biases introduced by truncation.
  4. Attention to Data Distribution: Ensure that the distribution of the truncated data matches the real-world distribution of the task at hand, avoiding over-truncation of certain parts of the data.

In conclusion, while truncation is a practical necessity in many NLP tasks, it's important to apply it thoughtfully to maintain the integrity and utility of the data. By understanding the trade-offs involved and carefully tuning the truncation parameters, one can help ensure that the resulting model is both efficient and effective.


Why would we need to use truncation?

Truncation is used in the preprocessing of data for machine learning, particularly in natural language processing (NLP), for several reasons:

  1. Model Input Limitations: Many NLP models, like those based on the Transformer architecture, have a maximum sequence length they can handle. For instance, the original BERT model accepts a maximum length of 512 tokens. If an input sequence exceeds this limit, it must be truncated so the model can process it.
  2. Computational Efficiency: Longer sequences require more computational resources (memory and processing power). By truncating sequences to a manageable length, we can train models more efficiently, both in terms of speed and resource usage.
  3. Batch Processing: When training a model, it's common to process data in batches for faster computation. To create batches, all input sequences typically need to be the same length. Truncation (and padding) standardizes sequence lengths so that they can be batched together.
  4. Focus on Relevant Information: Sometimes, especially with very long documents, not all information is equally relevant to the task at hand. Truncation can help to focus the model's attention on the most relevant parts of the text. For instance, in a sentiment analysis task, the sentiment is often expressed in certain key sentences or phrases within a larger body of text.
  5. Avoiding Noise: In some datasets, especially those scraped from the web, there can be a lot of noisy data towards the ends of documents (such as disclaimers, advertisements, etc.). Truncation can help remove such irrelevant sections of the text.
  6. Uniformity Across Dataset: Truncation enforces uniformity in the length of input sequences, which can help in comparing model performance across different inputs and ensure that the training process treats each input sequence equally.
  7. Resource Constraints: In practical scenarios, especially with limited computational resources, handling shorter sequences via truncation is sometimes the only feasible way to train a model.

While truncation is necessary and useful, it's important to apply it judiciously to avoid losing critical information that could affect model performance. Understanding the context and content of the data can help determine the best strategy for applying truncation.


Conclusion

Tokenization is a nuanced dance of cutting, encoding, and sometimes expanding or shrinking text to suit the needs of LLMs. This preparatory step is not just about breaking down language; it's about rebuilding it into a structure that unlocks the potential of AI in understanding and generating human language. With these Python examples, we've pulled back the curtain to reveal the inner workings of this transformative NLP process.

For those embarking on the journey of NLP and machine learning, this article and its accompanying visualization offer a gateway into the meticulous world of data preparation, where tokenization stands as a sentinel, bridging words to wisdom.

Nizar MAHMOUDI

Data Scientist | AI Engineer

2mo

it's really helpful . Thank you

Like
Reply
Thirunavukkarasu Kathirvel

Senior Technology Architect @Infosys | Digital Commerce | AWS and Azure Cloud Certified Architect | Microservices | DevSecOps | Agile & SAFe Agile

11mo

Thanks. Very nice article about Transformers architecture!

Like
Reply

To view or add a comment, sign in

More articles by Rany ElHousieny, PhDᴬᴮᴰ

Insights from the community

Others also viewed

Explore topics