High Fidelity Retrieval Augmented Generation (RAG) with Meta Llama 3.1 at PubNub
tl;dr we combine Meta Llama 3.1, OpenParse, PG Vector, >4K Embedding (NV-Embed-v2), Human Answers and FTS (Full Text Search Keyword GIN Indexing).
SOTA (State-Of-The-Art) represents the current best known approach. Involving technology and methods. In this case, it is as simple as methods to enhance information retrieval. We want to improve the accuracy of AI-generated responses. Better data, for better AI responses. At PubNub, we're building Retrieval Augmented Generation (RAG) systems using the best combinations of tech we’ve found. Utilizing OpenParse, 4K embedding models, the Meta Llama 3.1 models and PG Vector + PG FTS. This approach not only improves the precision of our query responses, it also improves relevance. We’ll walk over the tech used in our process.
Meta Llama 3.1 and OpenParse
There are great LLMs to choose from. Meta Llama 3.1 is in the top list. Meta Llama 3.1 includes 8B, 70B, and 405B parameter models. These models are great. Known for high performing language understanding and generation capabilities. This is an important part of the RAG system. You can see the capabilities through Hugging Face's model repository: Meta Llama on Hugging Face.
Document Conversion and Parsing
Our documents are primarily in HTML format, which we convert into PDFs. OpenParse likes PDF. OpenParse, an open-source Python library, handles document splitting. OpenParse is used for segmentation by preserving the structure and context within documents. This improves over traditional naive text splitting methods. The library effectively captures formatting such as headings, sections, bullet points, and even tabular data, ensuring that the extracted segments are semantically meaningful.
For more details on OpenParse, you can check their repository and PyPi package.
Creating Vector Embeddings
Once parsed, we generate vector embeddings leveraging high-dimensional embedding models listed on the MTEB Leaderboard. These embeddings include semantic nuances of the text, making them paramount for accurate information retrieval. Vector embeddings are kind of like a super-thesaurus. It creates connections to words and phrases to capture intent as well. It goes beyond just keyword search.
Dual Indexing: Vector and Keyword FTS
We use PG Vector, a PostgreSQL extension designed for vector indexing. It supports our high-dimensional embeddings. Good for fast and accurate similarity search.
Recommended by LinkedIn
Keyword Full Text Search (FTS)
In addition to vector indexing, we implement keyword-based Full Text Search (FTS). This method complements vector searches by ensuring that exact matches and relevant contexts are also considered during the retrieval process. The snippets and documents are indexed using GIN (Generalized Inverted Index)-based. This is a simple keyword index. This is a good reason to use PostgreSQL because it supports both indexing types.
The Retrieval Process
When a user submits a query (they ask a question), we run searches against both our vector and FTS indexes. The results from both searches are added to the user's question before feeding it into the Meta Llama model. The generated response is accurate and contextually more relevant. Because of an update to the longer context window size, we can include the full original documents and snippets too.
You can choose to combine top search responses from both FTS and Vector, or you can rank quality of search relevancy by 50% between FTS and Vector. We like to include both and deduplicate shared results.
Human Augmented
For frequently asked questions (FAQs), we're incorporating human-written responses with high priority context tags. By prioritizing crafted answers, the LLM can favor human expertise. We are blending it essentially. This also helps our approach to provide “example responses” to the LLM so it can follow the pattern that a human has written.
The RAG Approach:
At PubNub, our RAG system is built with OpenParse and Meta Llama 3.1 models. This represents a good combo of the best tech available. Graphical parsing, 4K embeddings, multi-indexing and human responses. For those looking to implement a high-fidelity RAG system, consider these models and approaches.
Let us know
Do you see any improvements or additions we can make?
SDEIII at Amazon
3moBuild a knowledge graph with data ingested with clean schemas and hand the DDL to the LLM and have it do queries like a human would instead of doing the embeddings deal. The closer I push LLMs toward doing what I would do, the better results I get. Clean structured data stores are easier for humans to interact with and the same goes for the LLMs. I really think the vector stores are equivalent to ctrl+f and won’t be around except for homogenous data like searching an audio data set or image store.
Software Development | Managed Team | Team extestion | AI/ML Development
3moThanks for sharing Stephen, commenting for better reach 🤗
Staff Engineer at May Mobility | YouTube Partner | Ex General Motors (Views are my own obviously)
3moI think that Kamil Litman might have some ideas here ^^
CTO
3moHow to improve this RAG? Is it possible!? 😄