High Fidelity Retrieval Augmented Generation (RAG) with Meta Llama 3.1 at PubNub

High Fidelity Retrieval Augmented Generation (RAG) with Meta Llama 3.1 at PubNub

tl;dr we combine Meta Llama 3.1, OpenParse, PG Vector, >4K Embedding (NV-Embed-v2), Human Answers and FTS (Full Text Search Keyword GIN Indexing).

SOTA (State-Of-The-Art) represents the current best known approach. Involving technology and methods. In this case, it is as simple as methods to enhance information retrieval. We want to improve the accuracy of AI-generated responses. Better data, for better AI responses. At PubNub, we're building Retrieval Augmented Generation (RAG) systems using the best combinations of tech we’ve found. Utilizing OpenParse, 4K embedding models, the Meta Llama 3.1 models and PG Vector + PG FTS. This approach not only improves the precision of our query responses, it also improves relevance. We’ll walk over the tech used in our process.

Meta Llama 3.1 and OpenParse

There are great LLMs to choose from. Meta Llama 3.1 is in the top list. Meta Llama 3.1 includes 8B, 70B, and 405B parameter models. These models are great. Known for high performing language understanding and generation capabilities. This is an important part of the RAG system. You can see the capabilities through Hugging Face's model repository: Meta Llama on Hugging Face.

Document Conversion and Parsing

Our documents are primarily in HTML format, which we convert into PDFs. OpenParse likes PDF. OpenParse, an open-source Python library, handles document splitting. OpenParse is used for segmentation by preserving the structure and context within documents. This improves over traditional naive text splitting methods. The library effectively captures formatting such as headings, sections, bullet points, and even tabular data, ensuring that the extracted segments are semantically meaningful.

For more details on OpenParse, you can check their repository and PyPi package.

Creating Vector Embeddings

Once parsed, we generate vector embeddings leveraging high-dimensional embedding models listed on the MTEB Leaderboard. These embeddings include semantic nuances of the text, making them paramount for accurate information retrieval. Vector embeddings are kind of like a super-thesaurus. It creates connections to words and phrases to capture intent as well. It goes beyond just keyword search.

Dual Indexing: Vector and Keyword FTS

We use PG Vector, a PostgreSQL extension designed for vector indexing. It supports our high-dimensional embeddings. Good for fast and accurate similarity search.

Keyword Full Text Search (FTS)

In addition to vector indexing, we implement keyword-based Full Text Search (FTS). This method complements vector searches by ensuring that exact matches and relevant contexts are also considered during the retrieval process. The snippets and documents are indexed using GIN (Generalized Inverted Index)-based. This is a simple keyword index. This is a good reason to use PostgreSQL because it supports both indexing types.

The Retrieval Process

When a user submits a query (they ask a question), we run searches against both our vector and FTS indexes. The results from both searches are added to the user's question before feeding it into the Meta Llama model. The generated response is accurate and contextually more relevant. Because of an update to the longer context window size, we can include the full original documents and snippets too.

You can choose to combine top search responses from both FTS and Vector, or you can rank quality of search relevancy by 50% between FTS and Vector. We like to include both and deduplicate shared results.

Human Augmented

For frequently asked questions (FAQs), we're incorporating human-written responses with high priority context tags. By prioritizing crafted answers, the LLM can favor human expertise. We are blending it essentially. This also helps our approach to provide “example responses” to the LLM so it can follow the pattern that a human has written.

The RAG Approach:

  1. Enhanced Contextual Understanding: By using OpenParse's graphical parsing, we ensure that document segments retain their context. Better quality information.
  2. Embedding Models: Using top-tier embedding models for finding semantic details in documents leads to improved search quality.
  3. Indexing: Combining vector and keyword FTS indexes allows us to capture semantic and exact matches. Kind of a holistic retrieval mechanism.
  4. Prioritizing Human Expertise: The addition of human-augmented responses allows our system to fill in remaining accurate details.

At PubNub, our RAG system is built with OpenParse and Meta Llama 3.1 models. This represents a good combo of the best tech available. Graphical parsing, 4K embeddings, multi-indexing and human responses. For those looking to implement a high-fidelity RAG system, consider these models and approaches.

Let us know

Do you see any improvements or additions we can make?

References:

Build a knowledge graph with data ingested with clean schemas and hand the DDL to the LLM and have it do queries like a human would instead of doing the embeddings deal. The closer I push LLMs toward doing what I would do, the better results I get. Clean structured data stores are easier for humans to interact with and the same goes for the LLMs. I really think the vector stores are equivalent to ctrl+f and won’t be around except for homogenous data like searching an audio data set or image store.

Michael Pihosh

Software Development | Managed Team | Team extestion | AI/ML Development

3mo

Thanks for sharing Stephen, commenting for better reach 🤗

Dario Lencina Talarico

Staff Engineer at May Mobility | YouTube Partner | Ex General Motors (Views are my own obviously)

3mo

I think that Kamil Litman might have some ideas here ^^

How to improve this RAG? Is it possible!? 😄

Like
Reply

To view or add a comment, sign in

More articles by Stephen Blum

Insights from the community

Others also viewed

Explore topics