This Week in AI #1: Scaling Q&A, Efficiency Trade-offs, and Increasing Credibility
This week in AI #1: Scaling QnA systems while fighting efficiency trade-offs & generating credible answers

This Week in AI #1: Scaling Q&A, Efficiency Trade-offs, and Increasing Credibility


Hey!

Welcome to the very first edition of This Week in AI: RAG Edition.

In this newsletter, we'll cover recent developments in the world of Gen AI but with RAG focus every week. We'll look into research papers that'll help us become better at - using Gen AI to extract better answers from a large corpus, using it for customer support, sales and enterprise search in general.

We'll examine research papers in detail - What problem they're solving? Go deeper into methodologies and look at the improvements.

We're excited to share latest insights with you!

Let’s kick off this inaugural edition with a sneak peek at three fascinating research papers we'll discuss today:

  • Pistis-RAG: A Comprehensive Framework for Improving RAG Performance
  • Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation
  • RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

So let's start!


1. Pistis-RAG: A Scalable Cascading Framework Towards Content-Centric Retrieval-Augmented Generation


In Greek mythology, Pistis symbolized good faith, trust, and reliability. Drawing inspiration from these principles, Pistis-RAG is a scalable multi-stage framework designed to address the challenges of large-scale retrieval-augmented generation (RAG) systems.

What problem are they trying to solve?

Scaling RAG systems means compromises with performance or efficiency. But, a new study has made strides to enhance the quality and relevance of the generated responses by optimizing the retrieval and ranking processes at scale. The key issues they're tackling include:

  1. Prompt Order Sensitivity: Large Language Models (LLMs) can generate different outputs based on the order of the prompts they receive. This sensitivity can lead to incoherent or irrelevant responses if not managed properly.
  2. Efficiency of Information Retrieval: Ensuring that the most relevant information is retrieved quickly and accurately from large datasets.
  3. User Experience: Enhancing the coherence and relevance of the responses to improve overall user satisfaction.

This illustration highlights the disparity between traditional Information Retrieval (IR) systems and the RAG system,


How are they solving the problem?

The paper introduces a comprehensive framework called Pistis-RAG, which integrates several key components to address these challenges:

1. Matching Service:

  • Vector Storage: Stores document representations as vectors to enable fast similarity comparisons with the user’s query vector.
  • Inverted Index: Allows rapid identification of documents containing specific keywords from the user’s query.
  • K-V Cache: Maintains user conversation context and stores frequently accessed prompt-answer pairs to enhance response times and personalization.

2. Ranking Service:

  • Pre-Ranking: Filters and narrows down candidate documents before the main ranking process. This stage ensures that the most relevant few-shot examples are presented to the LLM first.
  • Full Ranking: Prioritizes items based on their relevance to the user’s query and intent. This stage addresses prompt order sensitivity by ensuring the most informative prompts are used first, leading to more coherent and accurate responses.

3. Experimental Setup:

  • Data Preparation: The MMLU dataset, containing 15,908 questions, is used for evaluation. This dataset is indexed and used to retrieve and rank candidate few-shots for generation.
  • Model Components:

- BEG-M3: Retrieves the top 10 candidate few-shots.

- BEG-reranker-larger: Pre-ranks the candidate few-shots to select the top five.

- Llama-2-13B-chat: Generates text from the selected few-shot prompts using a sophisticated transformer architecture.

Overview of the Pistis RAG framework

What are the results?

Here are experimental results demonstrating the effectiveness of the Pistis-RAG framework. Key performance metrics used include precision, recall, and F1-score.

  • Baseline Model: F1-score of 50.0%
  • Without Ranking Stage: F1-score of 52.3%
  • Without Reasoning and Aggregating Stage: F1-score of 52.8%
  • Full Pistis-RAG: F1-score of 54.65%

What's next?

Pistis-RAG demonstrates significant improvements in the performance of RAG systems, especially in handling prompt order sensitivity and improving information retrieval efficiency. The comprehensive framework can be used by enterprises to ensure more coherent and relevant responses, reducing frustrations and enhancing overall user satisfaction.



2. RankRAG: Unifying Context Ranking with Retrieval-Augmented Generation in LLMs

In this work, we propose a novel instruc tion fine-tuning framework RankRAG, which instruction-tunes a single LLM for the dual purpose of context ranking and answer generation in RAG


What problem are they trying to solve?

There's always a trade-off between efficiency and performance in retrieval-augmented generation (RAG) systems. Even accepting this, there's still room for optimization.

In this research paper, the researchers are addressing key limitations in current Retrieval-Augmented Generation (RAG) systems used by large language models (LLMs) and trying to find better ways to minimize performance losses while improving efficiency in model training and inference, particularly in the context selection and answer generation stages of RAG pipelines..

  1. Efficiency and Accuracy Trade-off: Traditional RAG systems struggle with reading numerous context chunks (e.g., top-100) effectively. While fewer contexts (e.g., top-5 or top-10) improve accuracy, they risk missing relevant information.
  2. Retrieval Limitations: Relying solely on dense embedding-based retrieval models often falls short in ensuring high recall of relevant content due to ineffective local alignments across the embedding space.
  3. Limited Zero-shot Generalization: Existing expert ranking models, separate from the LLMs, are not versatile and struggle with generalization across different tasks and domains.


How are they solving the problem?

Previous RAG approaches often suffered from two main issues: inefficient context selection and the need for separate retrieval and ranking models. RankRAG addresses these problems through a novel instruction fine-tuning framework called "RankRAG".

RankRAG is a novel instruction fine-tuning framework designed to integrate context ranking and answer generation within a single LLM.

RankRAG uses a clever trick of casting all tasks into a standardized (question, context, answer) format. This allows the model to transfer knowledge between ranking and generation tasks effectively.

During inference, RankRAG first reranks a set of retrieved contexts, then generates the final answer using only the most relevant ones. This approach acts like "attention sinks" for RAG, helping the model maintain focus on the most important information without losing overall context. The results show that even with a relatively small amount of ranking data (just 1% of a standard dataset), RankRAG can significantly outperform dedicated ranking models and larger language models on various knowledge-intensive tasks.

Experimental setup:

  • Training Data: A blend of context ranking data is integrated into the instruction-tuning data.
  • Inference Mechanism: The model first reranks the retrieved contexts and then uses the refined top-k contexts to generate the answer.

What are the results?

  • General-Domain Benchmarks: Llama3-RankRAG models (8B and 70B) significantly outperform the Llama3-ChatQA-1.5 models on nine general-domain RAG benchmarks.
  • Biomedical Domain Benchmarks: Without instruction fine-tuning on biomedical data, Llama3-RankRAG models perform comparably to GPT-4 on five RAG benchmarks in the biomedical domain, demonstrating strong generalization capabilities.

  • Llama3-RankRAG vs. Llama3-ChatQA-1.5:NQ (Natural Questions): Exact Match (EM) improvements by several percentage points.TriviaQA: EM improvements ranging from 2-4%.PopQA: EM improvements showing a substantial margin.FEVER: EM improvements of 1-2%.



3. Ground Every Sentence: Improving Retrieval-Augmented LLMs with Interleaved Reference-Claim Generation

We propose a method called RECLAIM, which alternately generates citations and answer sentences, to enable large models to generate answer with citations.

What problem are they trying to solve?

You type a question, and LLM generates and answer - with bold claims. How'd you know it's the truth?

This research paper aims to shed light on the two main issues with Gen AI QnA systems:

  1. Improving the Quality of Answers and Citations: Ensuring that the answers generated by models are fluent, correct, and well-cited. The challenge is to enhance the quality of both answers and the citations that support them.
  2. Enhancing Verifiability and Credibility: Improving the verifiability and credibility of answers generated by Retrieval-Augmented Generation (RAG)-based question-answering systems. This involves ensuring that the information generated by the model can be traced back to credible sources and is accurate.


How are they solving the problem?

The paper introduces a method called RECLAIM (Reference-Claim Interleaving Model) to solve these problems. The approach involves several key steps:

Step 1: Task Formalization: The task is defined as generating an output consisting of fine-grained references and claims. Given a query q and several reference passages D, the model generates an output O that alternates between references (r1,r2,...rn) and claims (c1,c2,....cn). Each reference substantiates a corresponding claim, and together they form a complete, coherent answer.

Step 2: Training Dataset Construction:

The initial data set used WebGLM-QA with 43,579 samples of rich references and detailed answers. The dataset was further segmented to identify relevant citations and used NLI methods to to ensure attribution quality.

The data was then filtered to remove mismatched citations.The remaining refined dataset of 9,433 samples was used. This ensured text consistency and high-quality attributions.

Step 3: Model Training:

Two models, ReferModel and ClaimModel, are trained separately:

For the ReferModel, the researchers performed full fine-tuning with a learning rate of 2e-5 over 3 epochs. For the ClaimModel, they employed Lora tuning (Hu et al., 2021) with a learning rate of 5e-5 over 5 epochs.

ReferModel and Claim model


Step 4: Evaluation and Comparison:

The approach is evaluated against existing methods like ALCE. Metrics such as fluency, correctness, and citation accuracy are used to measure performance.

What are the results?

The RECLAIM methodology shows improvements in various metrics.

  • The RECLAIMUnified method increased the fluency of model responses by 15.6% on average, with a minimal impact on correctness (1.6% decrease). It also reduced citation quality by 23.4%.
  • The RECLAIM w/IG method, using the Llama3-8B-Instruct model, outperformed the ALCE method with ChatGPT, showing a 23.7% improvement in fluency, a 30.0% increase in citation accuracy score (CAS), and a 7.9% boost in citation relevance score (CRS), though correctness decreased by 4.0%.

Overall, RECLAIM w/IG achieved the best performance in fluency and citation quality, while the claim-only method had the highest correctness score of 37.8 but at the cost of increased response length and reduced fluency.


Other interesting reads:


Enjoying the content?

Make sure to subscribe for your weekly dose of AI insights, and don’t forget to share in the comments if there's a particular aspect of any research paper you’d like us to explore further.

What’s the biggest question you have about RAG and Gen AI? Let us know in the comments. We read each one and may just feature yours in a future newsletter!

Happy reading, and see you next week!


By Alltius

To view or add a comment, sign in

More articles by Alltius

Insights from the community

Others also viewed

Explore topics