Why do LLMs Hallucinate?

Because response diversity cannot exist without it

Large language models (LLMs) generate text by predicting the most likely sequence of words based on a given prompt. While LLMs are powerful, they can sometimes produce factually incorrect or unrelated responses to the prompt. These mistakes are commonly known as ‘hallucinations’.

Hallucinations happen when the model creates information that sounds plausible but isn’t grounded in real-world facts. Hallucinations derive from the same mechanisms that allow LLMs to generate creative and flexible responses. This article will explore why hallucinations occur, the trade-off between hallucinations and creativity, and how we can reduce hallucinations without stifling the diversity in a model’s outputs.

How do LLMs Create a Response?

Let’s examine a high-level process of what happens — refer to Figure 1. We often call this approach Top-P, and it is commonly used to create LLM responses.

Step 1: Submit a Prompt

The first step is straightforward: we submit a prompt to our LLM of choice.

Step 2: Assign Probabilities to Words

Based on previous training, the model assigns probabilities to words in its vocabulary. I use ‘words’ here, although a vocabulary can include other items such as partial words and punctuation. A vocabulary can contain 50,000 items or more. As a comparison, a typical native English speaker will have learned between 15,000 and 35,000 words by the time they reach adulthood.

Figure 2 shows bars representing words and their assigned probabilities. The chart shows only the first nine items of a model’s vocabulary.

Step 3: Sort the Vocabulary by Probability

The vocabulary is sorted by probability, with the most likely words at the beginning (Figure 3).

Step 4: Determine the Shortlist Threshold

Now we have a sorted vocabulary, it’s time to create a shortlist. To do this, we use a threshold, a value between zero and 1. A threshold of 0.9 is typical, although we can change this if necessary. We use this threshold to create a shortlist of candidate words.

Step 5: Create a Shortlist

A shortlist will contain enough words so that their cumulative probabilities are greater than, or equal to, the threshold. Beginning on the left hand side of our sorted vocabulary, we continue adding words until the statement above is satisfied. If the first probability, P1 equals or exceeds the threshold then our shortlist will contain only 1 item.

Figure 4 shows an example where five words are required to exceed the threshold such that:

In other words:

The sum of the first four probabilities is less than the threshold value.
The sum of the first five probabilities is greater than or equal to the threshold value.

So, we now have a shortlist consisting of the first five words in the sorted vocabulary list.

Step 6: Randomly Choose the Next Word

In this step, the next word is chosen randomly from the shortlist. The process loops until the response is complete. A model has several ways to determine when to stop — we won’t go into the details of stopping here.

Any word in the shortlist has an equal chance of selection, regardless of its probability.

Once we have a shortlist, the word probabilities are no longer required. The model chooses an item from the shortlist entirely at random. This random selection is where the potential for hallucination exists. The mechanism that creates diversity in LLM responses is the same mechanism that causes hallucinations.

Shortlist Examples to Highlight Hallucinations

Including low-probability words in the shortlist can lead to hallucinations. Figure 5 shows three possible shortlists.

Scenario 1: A single candidate

The probability of the first word equals or exceeds the threshold, so this one single word is all we need. Single-word shortlists are more likely to occur for factual statements and everyday phrases such as:

The capital of England is … London
Once upon a … time

Lowering the threshold value will increase the chances of a single candidate. With lower thresholds, the response will have low diversity and is less likely to hallucinate.

Scenario 2: Candidates with similar probabilities

For this scenario, all candidates in the shortlist have similar probabilities. This scenario is beneficial for creative responses such as storytelling or ideation. Hallucinations can occur if the shortlist contains candidates not strongly related to the context of the prompt.

Scenario 3: Candidates with low probabilities

As we have seen, even though there is only one high-probability word, the random selection process in Top-p sampling treats all shortlisted words equally.

The Top-p process means the model is equally likely to choose one of the lower-probability words.

A low-probability word may be acceptable if you use an LLM to help you write a story. For fact-based responses, this scenario may not yield credible responses.

Can we Reduce Hallucinations?

We can do several things to reduce hallucinations, and it is an area of active research. The effort required for a particular approach may vary massively; examples include:

Reduce the Top-p threshold or the model’s response temperature.
Always choose the most probable word at the risk of removing diversity.
Use Top-k sampling where only the k most probable words are selected regardless of the length of the candidate list.
Integrate external information using RAG (Retrieval Augmented Generation).
Some form of verification after an LLM generates a response.
Carefully designed prompts COSTAR is a technique often used in prompt engineering.
Set a Min-p threshold where the model only considers words with a probability greater than this threshold. Combining Min-p and Top-p is a common strategy.
Train a model using highly curated and validated data.

Final Thoughts

There is currently no method or combination of techniques that will completely eradicate hallucinations. After all, models work by predicting outputs rather than retrieving them. The steps you take to reduce hallucination will depend on your use case. Compare a model that provides medical advice with one that writes children’s stories.

Always consider the trade-off between diversity/creativity and the potential for hallucinations. Because, at least for now, you can’t have one without the other.

Why do LLMs Hallucinate?

Matthew Weaver

Experienced consultant - Team Leader | Gen AI | Machine Learning | Data Analytics | Problem Solver

Because response diversity cannot exist without it

How do LLMs Create a Response?

Step 1: Submit a Prompt

Step 2: Assign Probabilities to Words

Step 3: Sort the Vocabulary by Probability

Step 4: Determine the Shortlist Threshold

Step 5: Create a Shortlist

Recommended by LinkedIn

Step 6: Randomly Choose the Next Word

Shortlist Examples to Highlight Hallucinations

Scenario 1: A single candidate

Scenario 2: Candidates with similar probabilities

Scenario 3: Candidates with low probabilities

Can we Reduce Hallucinations?

Final Thoughts

More articles by this author

Insights from the community

Others also viewed

Paper Review: Pixel Aligned Language Models

The Limits of Artificial Thought: Why Today's LLMs Struggle with Logic and Responsibility

October "Are We There Yet?" - Disappearing Words, Mobbing, and "Learning to Build"

Language and Human Language

What is One Model (LLM) Many Tasks?

Conversing Across Time: The Linguistic Odyssey from the Oxford English Dictionary to GPT-4

Some painful untold facts about AI-tools - explanation with forensic linguistic outlook

Prompting Techniques in Large Language Models (LLMs)

Using AI to build the “definitive” Spanish writing tool

Crafting the Perfect Prompt for Large Language Models

Explore topics