Large Language Models
1. Introduction
Large Language Models (LLMs) are advanced artificial intelligence systems engineered to understand and produce human language through text generation, summarization, and more. They broadly fall into two categories:
To put it simply, LLMs learn from vast amounts of text data by identifying patterns and relationships between words that co-occur within sentences.
LLMs utilize sophisticated structures called Transformers, which integrate deep learning and natural language processing techniques.
Prominent examples of LLMs include GPT, BERT, LLaMA, and T5 trained on extensive datasets, including Wikipedia, millions of books, the entire web, and many other datasets as we will see later. Training such models demands significant computational resources, primarily GPUs, extensive training durations, and substantial financial resources, usually available only to large organizations.
Most users engage with pre-trained (Foundation) models and may apply fine-tuning strategies using specific knowledge bases. For instance, in the insurance sector, fine-tuning might involve incorporating insurance policy terms and conditions or anonymized customer interactions. Also, Question & Answer problems on private knowledge bases can tackled with Retrieval-Augmented Generation (RAG) which will be discussed later.
Transformers, the underlying architecture of LLMs, can be categorized into three types based on their functions:
We will see the key components and the different learning strategies of these transformers and finally, we will understand how to measure and compare the quality of different LLMs.
We will start analyzing each LLM component, spending more time on what I think is more relevant to understand why this architecture has revolutionized the world of Artificial Intelligence and finally, we will put all the pieces together.
2. LLM Components
We will see three types of components/blocks:
2.1 Tokenizer
A Tokenizer is used to transform a sentence into a sequence of tokens; for the sake of simplicity, we can assume that for a given sentence each word corresponds to a token (even if sub-word tokens are used by some tokenizers). So from now on I will be using "token" or "word" with no difference.
2.2 Embedding
Since the software can only manage numbers, embedding is the process designed to map each token to numbers, a vector, that is a sequence of numbers, is assigned to each token.
The beauty of the embedding is that the numerical representations of the words capture semantic similarity, and to some extent, syntactic relationships between words; since words can have many different meanings, each number somehow represents the meaning of the word in a different context.
To have an idea about a real embedding vector dimension, one of the most popular embedding models uses 1536 numbers to represent a single token; the number of words in an embedding model (vocabulary) is usually n*10.000.
For simplicity, we can imagine that each word is positioned in a multi-dimensional space (let's think about a 2D space for simplicity) not randomly but having the property that the distance between pairs of words represents their similarity (in terms of semantic meaning); we will see very soon what is the typical tool used to measure the distance between words in this space.
To grasp the intuition behind an embedding model, in the rest of this paragraph, I will show you how a naive word embedding can be built.
The idea is to train a deep network (see Figure 1) using the assumption that two words having a similar meaning will be surrounded by the same words (context) in most of the sentences of our training dataset; so, if we train the network to produce a similar output for these two words, the hidden weights of the neural network will be similar and can be used as word embedding.
Yes, you understood correctly: we are training a neural network not to use the output as we learned in the first series AI in Simple Terms for Non-Techy People but to use its internal weights!
Let's see a very small example to better clarify this concept; we are going to use a very small vocabulary made of only 9 words and only 4 short sentences as a training dataset; these are the sentences:
We will use a model dimension equal to two, meaning that each word will be represented by a pair like [1,4] or [-2, 3] and positioned on a Cartesian plane.
In our naive example, we are going to consider a context window (words preceding and following a given word) equal to one; under this assumption, the possible pairs of words are:
These pairs are our training dataset: the deep network will learn the probability distribution of the output words (second word of each pair) given each input word (first word of each pair).
For example, the network will learn that given as input the words "U2" or "Beatles" the most likely output should be the same: "singer"; and because the network output will be the same both when "U2" or "Beatles" is provided as input, this implies that the network's weights should be similar and can be used as embedding vectors.
Let's build this deep network.
Initially, we associate each word of the vocabulary with a progressive number (from 1 to 9 since the vocabulary contains 9 words) and then convert these numbers in vectors (whose length is 9) with 8 0s and only a 1 in the position corresponding to the number; for example, if the word "is" is associated with the number 1 and "a" is associated with the number 2, the two vectors will be respectively [1,0,0,0,0,0,0,0,0] and [0,1,0,0,0,0,0,0,0,0].
These vectors are used as input to train a deep network with:
Recommended by LinkedIn
There is no activation function in this network.
Figure 1
Given a word, only one input is activated (because we have configured the vectors to have 8 zeros) and 9 output values will be generated; initially, the outputs will be random numbers, depending on the initial weights but, with backpropagation and training, these 9 output numerical values will be representative of the probability distribution to have each of the 9 words given a specific word as input.
To have a real probability distribution, each number has to assume a value between zero and one, and the sum of the 9 numbers should be 1.
There is a special function used to convert a series of numbers into probability distribution); this function is called Softmax.
You can find more details here Introduction to Softmax for Neural Network
This formula can be scary at the beginning but it is quite simple; let's assume that we have a sequence of 5 numbers like this:
The softmax will calculate 5 exponential values and their sum (which is 75.1) that represents the denominator of the Softmax formula.
Finally, the softmax values are obtained by dividing each of these values by 75.1, obtaining numbers between zero and one whose sum is 1, hence representing a probability distribution.
Going back to our deep network and considering what is described in "Different types of AI", we are solving a multi-class classification problem with 9 possible classes.
Going back to our small dataset of 4 sentences and 9 words, after the training process, the input say 5 corresponding to the token "Beatles" will generate as output a probability distribution where the largest value will correspond to the number say 7, which corresponds to the word "singer".
As we said, at the end of the training process, the two weights connecting each input cell to the two hidden layers can be used as embedding vectors of the words as in Table 2. So the weights from input 1 to the two hidden nodes will be the embedding vector for the token associated with the number 1, and the weights from input 2 to the two hidden nodes will be the embedding vector for the token associated with the number 2, etc.. (see table 2, where the 9 numbers in the first columns start from zero and end to eight )
Table 2
We can represent graphically the embedding using a plane as in Figure 2 where, for simplicity, only a few relevant words are shown; in this graph, we can notice that our naive embedding put the pairs (U2, Beatles) and (Lennon, Bono) in a very similar position even if, considered the tiny dimension of the dataset, the embedding if far away to be perfect.
Figure 2
This example was a simplified version of one approach known as the Continuous Bag of words.
For an in-depth analysis you can read An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec ; also, a nice video on the topic is Word Embedding and Word2Vec, Clearly Explained!!!
Embedding can be trained in different approaches and greater window size using positive and negative examples and a binary classifier; also, optimization techniques need to be used to reduce the computation complexity of the neural network because the real numbers for an embedding are: an input dimension equal to n*10.000, hidden size equal to 1536 and a huge number of weights equals to n*10.000*1536*2 , so millions of weights to be updated by the gradient descent algorithm.
Once we have an embedding there are many different metrics for computing similarity between two words, Cosine Similarity being one of the most popular. Cosine similarity uses linear algebra, a branch of mathematics, to evaluate the similarity between two vectors. (see Figure 3).
Figure 3 source: Cosine similarity
Cosine-similarity between two vectors A and B with dimension two can be calculated as follows:
A = [a1,a2], B = [b1,b2]
Cosine similarity(A,B) = (a1*b1 + a2*b2) / (|a| * |b|)
Where |a| is the length of the vector, that is sqrt(a1*a1 + a2*a2) and |b| = sqrt(b1*b1 +b2*b2)