It is very easy to think that all you need to build a good RAG pipeline is to chunk your document using one of the splitters offered by LangChain, pass the chunks to an embedding model, and hook them up to a vector store that you can query at inference time to create the context for an LLM. I have spoken a bit about the importance of chunking and how hard it is to get it right for complex document types like PDFs. Let’s talk about similarity search today.
Relying only on similarity search could be a silent failure mode of your RAG pipeline. Let’s understand why:
Embedding models are encoder-style models where the last layer outputs a vector for every input token. These token-level embeddings need to be pooled to obtain a sentence-level embedding, which leads to a huge information loss. Embedding is essentially a form of lossy compression.
These models learn to prioritise the representation of those portions of a document that were required to answer the queries in its training data. However, the queries that your system needs to handle might require the model to focus on very different parts of the document. For similarity search, we typically embed the documents and queries separately. This means that the information compression in the documents (as described above) happens without any consideration for how the compressed embeddings are going to be used (i.e. without any consideration for the query). This removes any opportunity to let the query inform the compression.
These models are trained using a fixed, often outdated, vocabulary. So, it would not be able to accurately represent a word that has become common only recently (e.g. an obscure name of a newly released LLM).
What can we do?
Combine it with a keyword-search algorithm like BM25/TF-IDF. Why?
We, as humans, love using keywords. We are strongly inclined to notice and use certain acronyms and domain-specific jargons, that won’t be a part of the training data of these models.
BM25 is still a strong baseline that many SOTA models struggle to beat.
They give a free performance boost as they don’t add any compute overhead or cost during inference.
Use a reranker (e.g. cross-encoder or ColBERT) on the outputs of similarity search and keyword search:
Cross-encoders take both the raw query and document as input to predict a similarity score between them. This ensures that the full information of the query and the document is used, instead of just their compressed representations.
ColBERT is a family of models which uses “late-stage interaction” to compute the similarity between a given query and document. Here, an encoder is still used independently on both the query and the document to generate token-level embeddings for each. However, instead of pooling the token-level embeddings to obtain a sentence-level embedding, each token embedding in the query is compared with every token embedding in the document to find the maximum similarity for that query token. This is repeated for every query token and finally, the token-level scores are summed to obtain a score for the query-document pair.
Because of this, the ranking produced by a reranker is more reliable.
However, it is not practical to run this for every query-document pair because of the computation overhead.
Bringing it all together:
Given a query, use keyword search + similarity search on your documents to identify a smaller set of potentially relevant documents. In this step, you must optimise for ensuring that all the relevant documents are present in the output, even if it contains several irrelevant ones as well.
Use a reranker for every query-document pair on this smaller document set to rank the most relevant documents higher than the others and return the top few documents as the context for the LLM.
A note of caution: for every component you add/remove or any change you make to your retrieval pipeline, you must evaluate its impact on your evaluation dataset, instead of some generic benchmark data irrelevant to the task you care about, using retrieval-specific metrics like Precision@K, NDCG@k, Reciprocal Rank or any other metric suitable for your specific task.
While all these are important consideration in RAG systems, we often focus on model choices but overlook security. It's crucial to integrate access control (like RBAC or ABAC), pre-retrieval filtering, and query-time security checks to ensure only authorized users can access sensitive documents. This ensures compliance and protects against unauthorized access
ML Lead | Machine Learning | Generative AI | Data Science | Deep Learning
4moWhile all these are important consideration in RAG systems, we often focus on model choices but overlook security. It's crucial to integrate access control (like RBAC or ABAC), pre-retrieval filtering, and query-time security checks to ensure only authorized users can access sensitive documents. This ensures compliance and protects against unauthorized access