Large Language Models
Created with DALL-E

Large Language Models

1. Introduction

Large Language Models (LLMs) are advanced artificial intelligence systems engineered to understand and produce human language through text generation, summarization, and more. They broadly fall into two categories:

  • Language Generation, which encompasses tasks such as generating text, software code, translating, and others.
  • Language Understanding, which involves text classification, sentiment analysis, and other.

To put it simply, LLMs learn from vast amounts of text data by identifying patterns and relationships between words that co-occur within sentences.

LLMs utilize sophisticated structures called Transformers, which integrate deep learning and natural language processing techniques.

Prominent examples of LLMs include GPT, BERT, LLaMA, and T5 trained on extensive datasets, including Wikipedia, millions of books, the entire web, and many other datasets as we will see later. Training such models demands significant computational resources, primarily GPUs, extensive training durations, and substantial financial resources, usually available only to large organizations.

Most users engage with pre-trained (Foundation) models and may apply fine-tuning strategies using specific knowledge bases. For instance, in the insurance sector, fine-tuning might involve incorporating insurance policy terms and conditions or anonymized customer interactions. Also, Question & Answer problems on private knowledge bases can tackled with Retrieval-Augmented Generation (RAG) which will be discussed later.

Transformers, the underlying architecture of LLMs, can be categorized into three types based on their functions:

  • Encoder-only Transformers, which are designed for understanding language.
  • Decoder-only Transformers, which excel at generating text.
  • Encoder-Decoder Transformers, which are particularly effective for translation tasks, as they can both understand and generate language.

Table 1

We will see the key components and the different learning strategies of these transformers and finally, we will understand how to measure and compare the quality of different LLMs.

We will start analyzing each LLM component, spending more time on what I think is more relevant to understand why this architecture has revolutionized the world of Artificial Intelligence and finally, we will put all the pieces together. 

2. LLM Components

We will see three types of components/blocks:

  • Preprocessing: tokenizer, embedding, positional encoding; they are used to transform a sentence into something digestible by a transformer
  • Transformer blocks: attention, feed-forward, Softmax,.. these are the core blocks of a transformer, used to grasp the complexity of the human language
  • The transformer head:  this is the block specialized to generate the transformer output.

2.1 Tokenizer

A Tokenizer is used to transform a sentence into a sequence of tokens; for the sake of simplicity, we can assume that for a given sentence each word corresponds to a token (even if sub-word tokens are used by some tokenizers). So from now on I will be using "token" or "word" with no difference.

2.2 Embedding

Since the software can only manage numbers, embedding is the process designed to map each token to numbers, a vector, that is a sequence of numbers, is assigned to each token.

The beauty of the embedding is that the numerical representations of the words capture semantic similarity, and to some extent, syntactic relationships between words; since words can have many different meanings, each number somehow represents the meaning of the word in a different context.

To have an idea about a real embedding vector dimension, one of the most popular embedding models uses 1536 numbers to represent a single token; the number of words in an embedding model (vocabulary) is usually n*10.000.

For simplicity, we can imagine that each word is positioned in a multi-dimensional space (let's think about a 2D space for simplicity) not randomly but having the property that the distance between pairs of words represents their similarity (in terms of semantic meaning); we will see very soon what is the typical tool used to measure the distance between words in this space.

To grasp the intuition behind an embedding model, in the rest of this paragraph, I will show you how a naive word embedding can be built.

The idea is to train a deep network (see Figure 1) using the assumption that two words having a similar meaning will be surrounded by the same words (context) in most of the sentences of our training dataset; so, if we train the network to produce a similar output for these two words, the hidden weights of the neural network will be similar and can be used as word embedding.

Yes, you understood correctly: we are training a neural network not to use the output as we learned in the first series AI in Simple Terms for Non-Techy People but to use its internal weights!

Let's see a very small example to better clarify this concept; we are going to use a very small vocabulary made of only 9 words and only 4 short sentences as a training dataset; these are the sentences:

  • U2 is a rock band
  • Beatles is a rock band
  • U2 singer is Bono
  • Beatles singer is Lennon

We will use a model dimension equal to two, meaning that each word will be represented by a pair like [1,4] or [-2, 3] and positioned on a Cartesian plane.

In our naive example, we are going to consider a context window (words preceding and following a given word) equal to one; under this assumption, the possible pairs of words are:

These pairs are our training dataset: the deep network will learn the probability distribution of the output words (second word of each pair) given each input word (first word of each pair). 

For example, the network will learn that given as input the words "U2" or "Beatles" the most likely output should be the same: "singer"; and because the network output will be the same both when "U2" or "Beatles" is provided as input, this implies that the network's weights should be similar and can be used as embedding vectors. 

Let's build this deep network.

Initially, we associate each word of the vocabulary with a progressive number (from 1 to 9 since the vocabulary contains 9 words) and then convert these numbers in vectors (whose length is 9)  with 8 0s and only a 1 in the position corresponding to the number; for example, if the word "is" is associated with the number 1 and "a" is associated with the number 2, the two vectors will be respectively [1,0,0,0,0,0,0,0,0] and [0,1,0,0,0,0,0,0,0,0].

These vectors are used as input to train a deep network with:

  • 9 inputs, equal to the vocabulary size
  • 2 hidden layers (two is the model dimension we have defined) 
  • 9 outputs, again because 9 is the vocabulary size

There is no activation function in this network.

Figure 1

Given a word, only one input is activated (because we have configured the vectors to have 8 zeros) and 9 output values will be generated; initially, the outputs will be random numbers, depending on the initial weights but, with backpropagation and training, these 9 output numerical values will be representative of the probability distribution to have each of the 9 words given a specific word as input. 

To have a real probability distribution, each number has to assume a value between zero and one, and the sum of the 9 numbers should be 1. 

There is a special function used to convert a series of numbers into probability distribution); this function is called Softmax

 You can find more details here Introduction to Softmax for Neural Network

This formula can be scary at the beginning but it is quite simple; let's assume that we have a sequence of 5 numbers like this:

The softmax will calculate 5 exponential values and their sum (which is 75.1) that represents the denominator of the Softmax formula.

Finally, the softmax values are obtained by dividing each of these values by 75.1, obtaining numbers between zero and one whose sum is 1, hence representing a probability distribution.

Going back to our deep network and considering what is described in "Different types of AI",  we are solving a multi-class classification problem with 9 possible classes.

Going back to our small dataset of 4 sentences and 9 words, after the training process, the input say 5 corresponding to the token "Beatles" will generate as output a probability distribution where the largest value will correspond to the number say 7, which corresponds to the word "singer". 

As we said, at the end of the training process, the two weights connecting each input cell to the two hidden layers can be used as embedding vectors of the words as in Table 2. So the weights from input 1  to the two hidden nodes will be the embedding vector for the token associated with the number 1, and the weights from input 2 to the two hidden nodes will be the embedding vector for the token associated with the number 2, etc.. (see table 2, where the 9 numbers in the first columns start from zero and end to eight )

Table 2

We can represent graphically the embedding using a plane as in Figure 2 where, for simplicity, only a few relevant words are shown; in this graph, we can notice that our naive embedding put the pairs (U2, Beatles) and (Lennon, Bono) in a very similar position even if, considered the tiny dimension of the dataset, the embedding if far away to be perfect.

Figure 2

This example was a simplified version of one approach known as the Continuous Bag of words.

For an in-depth analysis you can read  An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec ; also, a nice video on the topic is Word Embedding and Word2Vec, Clearly Explained!!!

Embedding can be trained in different approaches and greater window size using positive and negative examples and a binary classifier; also,  optimization techniques need to be used to reduce the computation complexity of the neural network because the real numbers for an embedding are:  an input dimension equal to n*10.000, hidden size equal to 1536 and a huge number of weights equals to n*10.000*1536*2 , so millions of weights to be updated by the gradient descent algorithm.


Once we have an embedding there are many different metrics for computing similarity between two words, Cosine Similarity being one of the most popular. Cosine similarity uses linear algebra, a branch of mathematics, to evaluate the similarity between two vectors. (see Figure 3).

Figure 3 source: Cosine similarity

Cosine-similarity between two vectors A and B with dimension two can be calculated as follows:

A = [a1,a2],  B = [b1,b2]

Cosine similarity(A,B) = (a1*b1 + a2*b2) / (|a| * |b|) 

Where |a| is the length of the vector, that is sqrt(a1*a1 + a2*a2) and |b| = sqrt(b1*b1 +b2*b2)

To view or add a comment, sign in

More articles by Luigi Vassallo

  • Harnessing the Power of Ensembles: Lessons from Machine Learning for Management

    Harnessing the Power of Ensembles: Lessons from Machine Learning for Management

    As I recently completed the Machine Learning Scientist track on DataCamp, I found myself particularly intrigued by the…

  • LLM part 4

    LLM part 4

    Pre-requisite: LLM LLM part 2 LLM part 3 6. Retrieval Augmented Generation (RAG) This topic is worth a specific…

    1 Comment
  • Large Language Models - Part 3

    Large Language Models - Part 3

    Pre-requisites: LLM LLM part 2 After the attention mechanism, we need to quickly describe the last transformer modules…

  • Large Language Models - part 2

    Large Language Models - part 2

    This follows part 1: Large Language Models 2.3 Positional Encoding Word embedding represents the meaning of the word in…

    1 Comment
  • Deep Neural Networks

    Deep Neural Networks

    I am assuming that we are already familiar with the first three pills (Introduction to AI, How AI Learns, Different…

    4 Comments
  • Different types of AI

    Different types of AI

    In the first two pills, Introduction to AI and How AI Learns, we had a basic understanding of which kinds of problems…

    2 Comments
  • How AI learns

    How AI learns

    Assuming that you have already read the first article Introduction to Artificial Intelligence, we will see how AI…

    5 Comments
  • Introduction to Artificial Intelligence

    Introduction to Artificial Intelligence

    As humans, we go to school when we are young and we produce goods and services with our work activity as adults;…

    13 Comments
  • "Kiss Up, Kick Down" in the Workplace

    "Kiss Up, Kick Down" in the Workplace

    Have you ever witnessed "kiss up, kick down" behavior at work? This phrase refers to a pattern where individuals are…

    3 Comments
  • Homo smartphonicus

    Homo smartphonicus

    Have you ever considered that in specific circumstances the decision process of the Homo Smartphonicus is not really…

    1 Comment

Insights from the community

Others also viewed

Explore topics