Towards Advanced RAG
Retrieval Augmented Generation (RAG) is the application of information retrieval techniques to generative models, such as Large Language Models (LLMs), to produce relevant and grounded responses, conditioned on some external dataset or knowledge base.
In layman terms, RAG enables generative models to source answers quickly and accurately from large data-sets. We firmly believe that an AI agent, like a team-mate, is only as good as the training and knowledge-sharing it receives. As such, having a world-class RAG implementation is key to our vision of enabling every business to employ an AI Agent one day.
Some key challenges with LLMs that are addressed by RAG are:
However, it is not a silver bullet and requires careful consideration in terms of its components and architecture. For example, it is sensitive to things like:
Furthermore, there is a need for rigorous evaluation to ensure factuality and grounding in the provided data sources.
Improving our RAG Architecture
When I joined Relevance AI, the first major task was to improve our existing RAG system—to make it accurate, easier to iterate, more auditable, and future-proof it, to a degree.
Key Hypotheses
To break down the task into achievable goals I developed 5 key hypotheses.
Formulation
RAG generates an output sequence Y = (y1, y2, ..., yM) token-by-token given an input sequence X = (x1, x2, ..., xN) and a set of retrieved documents
D = (d1, d2, ..., dK).
The probability of generating Y is calculated as:
Where y_{<i} = (y_1, y_2, ..., y_{i-1}) are the previously generated tokens.
The conditional probability is modeled by the following:
Some key components:
Additionally, techniques like query manipulation, retrieval reranking, retrieved document selection, enhancing and customization can improve performance.
Metrics
We evaluate the retrieval and generation stages of our RAG pipeline separately using the metrics discussed below.
Note: We use the following terms interchangeably: contexts ↔ ground truth chunks (unless specified as retrieved) ↔ top-k chunks ↔ number of retrieved chunks.
Retrieval
Where k is the (i+1)th chunk, Nr is the number of relevant chunks retrieved, Nrk is the number of relevant chunks up to k, and rk is a binary indicator for chunks that are relevant vs not.
Where N8 is the number of attributable statements as extracted and classified by the evaluator LLM, and N is the total number of statements.
Generation
Where N is the total number of statements and Nr is the number of relevant statements. A LLM is used for extracting the statements from the response, and then classifying whether those statements are relevant to the query.
where Nt is the number of truthful claim and N is the total number of claims. The evaluator extracts all claims made in the response and then classifies which are truthful based on the facts in the retrieved context.
Where Nc is the number of contradicted contexts and N is the total number of ground truth chunks or contexts.
Implementation Details
Synthetic Data Generation
Dataset for RAG
In order to accurately assess the performance, I needed to create synthetic test datasets which closely model the real world data generation process. For a RAG system, this comes down to creating a dataset of which contain the gold labels:
where (q, a, c) represent a single test example triple of query, answer (or ground truth) and contexts respectively. Query is the user question, ground truth is the factual or expected response from the system and contexts are the verbatim references in the text.
The are many frameworks that allow for synthetic dataset generation like llama index, RAGAS and deepeval and come with their own set of parameters.
However, as we’ll discuss below, these approaches often require pre-chunked text and hence the quality of the test set becomes highly dependent on the method of chunking. This also raises another fundamental issue as to how humans parse a document and might think to ask questions. They are usually not all co-located within a single arbitrary chunk but rather relates to a more flexible, conceptual understanding of the document where the derived contexts may or may not be co-located.
Pipeline for synthetic corpus creation
We propose using vision as the modality of choice for generating a test set.
Steps:
Ingestion
Parser
During ingestion, the initial stage is for effective information extraction from various document types and formats. Extracting text and images varies quite a lot between each of these formats given their often incompatible and non-interoperable native encodings and standards. Broadly,
Initiatives like llama parse and directory reader from llama index (which is a high-level abstraction for file readers like PyMuPDF etc) seek to address this. However, document types like in the latter category are often sub-optimal as the relative structure between different components were not preserved thus making retrieval poor. Naive PDF readers would often break the hierarchy while extracting text, for example:
Customer Data Focus
Product Data Focus
Strengths
Feedback-driven development
Privacy first
Fast iteration cycles
Recommended by LinkedIn
Rapid prototyping ..
In the next section, we address this shortcoming with an agentic framework.
Agent-based router
I started by first defining a taxonomy of document types and their formats. Below is a rough overview of this taxonomy.
In this two-fold framework, the agent first:
## Customer Data Focus
### Strengths - Feedback-driven development - Privacy first
# Product Data Focus
## Strengths - Fast iteration cycles - Rapid prototyping
..
Similar to the synthetic data creation pipeline, we use a multimodal LLM as the classifier by feeding the first p \leq P pages, where P is the total number of pages (p = 3 by default).
The pipeline is document agnostic and further allows the user to “route” the query through different RAG strategies that best fit a document type/sub-type as determined through a grid search.
Embedding
Based on the Massive Text Embedding Benchmark (MTEB) [2] with an optimal balance between size and performance, we select the state-of-the-art BGE-family of open-source models published by BAAI [3]. Their smaller model bge-small-en-v1.5 (with 384 dimensions and 512 sequence length) in particular has high throughput and runs well even on low to mid-end consumer hardware. For asynchronous tasks with agentic loops, we switch to the larger bge-large-en-v1.5 (with 1024 dimensions and 512 sequence length).
Additionally for closed source, we go with: OpenAI’s text-embedding-3-large , Cohere’s cohere embed-english-v3.0 which rank highly in benchmarks.
On the closed source front we observe text-embedding-3-large consistently outperforming other proprietary models. We report the results in the sections below.
Indexing
For indexing and storing vectors, we leverage Qdrant DB. Their HNSW indexing algorithm provides robust performance for approximate nearest neighbour (ANN) searches with extremely low latency.
Chunking
We adopt the following strategies for chunking:
Retrieval
For our retrieval strategy, we use the following:
Reranker
Post-retrieval reranking allows for the chunks to be essentially reordered using a different model trained to compute relevance given some query and search results (chunks in our case). We try the following:
Generation
Finally, we choose the following LLMs to be the response generator using the retrieved chunks to answer the given query:
Key Results
Preliminary experiments: Basic RAG
Hyper-parameter Optimization
In order to determine the best combination of embedding, chunking and retrieval strategies, we perform a grid search over the following hyper-parameters:
--> Base: Uses a basic sentence splitter for chunking along with additional sanitization and cleaning
--> OpenAI's text-embedding-3-large
--> Cohere's embed-english-v3.0
--> Vector search
--> Keyword search (BM25 or Best matching algorithm)
--> Hybrid (keyword + Vector search)
--> OpenAI's gpt-3.5-turbo and gpt-4-turbo-previewMistral's mistral-medium
Below are some of the results using the closed sourced (highest performing) models:
Observations
Note: For this experiment we do not use the reranking step to get a baseline performance.
Legacy vs Advanced RAG v1.0
Next up, we measure performance, especially hallucination rates, between our legacy system vs advanced RAG on a synthetic dataset.
The legacy retrieval had the following specifications:
For advanced retrieval:
Observations
Note: Larger score is better for all metrics except hallucination; measured @k=3
Enabling the AI Workforce
It worked! Our Advanced RAG system is better on every measure than our legacy RAG system.
Expect this to improve over time as we continue to iterate on it going forward. It’s now enabled for all Relevance AI customers. Sign up now and try it out!
Learnings
Some reflections on what worked and what did not.
References