Comparison of Document Summary Index & Sentence Window Methods in RAG 
(Coding with LlamaIndex Walkthrough)

Comparison of Document Summary Index & Sentence Window Methods in RAG (Coding with LlamaIndex Walkthrough)

This is the coding walk-through of the RAG (Retrieval Augmented Generation) for Company Centric Documents: Summary Index vs Node Sentence Window Method Comparison article. It largely follows the demos mentioned in LlamaIndex’s Advanced RAG site.

Related Links:

Therea are 5 application files in the repository:

  • create_idx_summ_idx_md.py (Document Summary Index method: building retrieval knowledgebase)
  • rtrv_synth_qry_summ_idx_md.py (Document Summary Index method: retrieval and query)
  • create_idx_sent_window_md.py (Node Sentence Window method: building retrieval knowledge)
  • query_sent_window_md.py (Node Sentence Window method: retrieval and query)
  • get_index_info.py (class to examine vectorstore index summaries)

Because indexing takes between 10 to 20 minutes to process and has token costs, indexes needed to be persisted (saved) to disk first, and the retrieval & query are done separately by different apps. I didn’t create a main.py file to execute all as I plan to modify the app files for another larger project later on.

The two contextual retrieval methods for comparison are Document Summary Index and Node Sentence Window, which I will discuss more in detail later.

Here is a simple analogy:

  • Document summarization is like a park ranger looking for lost hikers in a forest. The park ranger targets likely big areas first and then narrow down to smaller areas. Factors such as terrain, weather, and past experience matter more. He can also assume that survivors are close to each other. Document summarization works similarly: it starts with larger context first before working its way to the semantic details.
  • The node sentence window method, also sometimes referred to as “big-small” retrieval, is more like a coast guard searching for the survivors of a shipwreck. He needs to ping specific locations in the vast ocean and drop in to conduct wider area searches. Survivors are likely to have drifted away from each other. Likewise, the sentence window method searches for specific sentences and brings back the “blurb” around it.

Essentially, both methods are supposed to make retrieval smarter by providing “context,” but for document summary, context is like a park ranger’s map, and for sentence window, it is more the beacons to the coastguard. Now let’s go through the codes.


Code Example: Document Summary Index Method

The code here is based on LlamaIndex’s tutorial - Document Summary Index tutorial.

This is split into two files so that we only do index once:

  • First file builds and save index.
  • Second file performs retrieval & query.


Build Knowledgebase (Reading Documents & Indexing)

Dependencies & Settings

First, I imported the libraries from Llama-Index and OpenAI.


Read Documents

[insert: create_idx_summ_idx_md_scr_shot_2_read_docs]

I used LlamaIndex’s reader, SimpleDirectoryRead to read the all the pdf files in the data folder (there is a PDF reader - using MuPyPDF, but I used this generic reader.)

If you just want to read certain files, you can use input_files= “…” parameter instead of input_dir.

Index

[insert: create_idx_summ_idx_md_scr_shot_3_build_index]


I instantiated openai API with my openai api key (I saved it directly in my VS Code’s launch.js file, but recommend you save it in your environment).

The ServiceContext function is a setting function:

  • I set LLM to openai’s gpt-3.5-turbo model for embedding and query (LlamaIndex actually defaults to OpenAI’s, but I wanted to specify it anyway).
  • I also set the token per minute “chunk_size_limit” to 1024 - to indirectly slow down the API call (OpenAI API has a token per minute limit - if you go over, the connection will break. Also, if the chunk is too large, it loses the semantic specificity, making it hard to retrieve.)

Note: you do not necessarily have to use the same LLM for embedding and query. Embedding requires you to send your entire dataset to the LLM - a different level of data exposure than just the query process. If you are using cloud LLM and working with big document files, embedding engines makes more sense b/c their token costs are much cheaper. Also, if you have to keep data in-house, you will have to use an on-premises LLM instead.

“splitter = SentenceSplitter …” specifies what “splitter” to use to cut texts into “chunks”, small snippets of texts that will be pulled out to be combined with your query to form the “augmented” prompt into a LLM (the core concept of RAG.) Chunk size is how big each snippet should be - 1,024 characters is the industry norm. Vanilla RAG models usually chunk exactly by no. of characters, which cut off sentences and words. Here, because we are using a sentence splitter here (SentenceSplitter), if the chunk length is reached but the sentence is not done yet, it will finish the sentence and then chunk (i.e., it will finish “related to the UK only.”

Each chunk is also referred to as nodes “chunk”, but it’s commonly called a “node” by LlamaIndex and some others, to reflect their relationships between the chunks and higher levels like sections, paragraphs, chapters, documents, so on…

It’s also important to note that the documents are also broken down into sub-documents (each pdf file is segmented into 20 to 100+ sub-documents depending on the document size.) Because I set filename_as_id = True when reading the documents, document ids are the full file paths of the original pdf documents plus “.pdf_part_0”, “…part_1”, etc. after.

The next line of code is the crux of the customization:

response_synthesizer = get_response_synthesizer(

response_mode="tree_summarize", use_async=False

)

The get_response_synthesizer function is part of the llama_index.response_synthesizers module. It sends text chunks into a Large Language Model (LLM) to get responses - summaries of the texts - back. But why is it here in the indexing stage?

The synthesizer here is to enhance RAG with LLM - we are using ChatGPT as a text summarizer. It shrinks the text size and builds a simple hierarchical structure in the retrieval knowledgebase. The response_mode determines how the user query and the retrieved text chunks are combined to form the prompt that is sent to the Language Model (LLM). Here it is set to “tree_summarize”, meaning hierarchical summarization.

When the synthesizer is called by DocumentSummaryIndex() in the next line, the LLM summarizes the text at the sub-document level.

Note: other response_mode include refine, compact, simple_summarize, no_text, accumulate, compact_accumulate: Similar to accumulate, but will "compact" each LLM prompt similar to compact, and run the same query against each text chunk.

Save/Persist to Disk

After indexing, which takes some time, I saved the index to disk using the persist method.

A Note on LlamaIndex’s Data Structure

If you are new to LlamaIndex like me, you may find LlamaIndex’s data structure here a little confusing. The chatbot actually won’t help much. I had to dig into the source code.

So let me explain it further:

  • Typically, a tree hierarchy needs a node class (object) and a tree class to manage the relationships between nodes to give it structure.
  • Llama index uses the BaseNode as the basic node class (BaseNode itself inherits from the BaseComponent class)
  • The BaseNode has a sub-class called TextNode, which is de facto the “node” in our context, because it stores the text “chunk”.
  • The TextNode then has different subclasses of its own. One of them is the Document class (the different node classes can be found in LlamaIndex’s schema.py file on github.
  • Here the Document object is actually not an entire document. It is really Sub-Document, because LlamaIndex uses the Document object to store the sub-document texts.
  • During chunking, the NodeParser creates a new node object, so that there are two “decks” of data: one for sub-documents (long texts) and one for nodes (short texts). They are linked to each by fields like “text_id_to_ref_doc_id” (text_id refers to node_id, and ref_doc_id refers to document id.)
  • DocumentSummaryIndex is the class responsible for generating, storing, and retrieving summaries (it inherits the tree structure from another class called IndexDocumentSummary, which inherits from other classes, so on.)
  • DocumentSummaryIndex class goes to the document (sub-document) “deck”, summarizes their texts with the LLM and replaces the full texts with summaries. It then turns the document objects into node objects and shuffles them into the node “deck.” This is because ultimately the TextNode objects can be sent to the LLM for embedding.

After indexing, there are two “layers of nodes”: leaf nodes with original text chunks (bottom layer), and root nodes with summary texts (top layer.) See below:

Personally, I find the creation of the “document” class to be redundant and confusing:

  • First, calling it Document is a misnomer and confuses users.
  • Secondly, it’s a haphazard, band-aid fix for creating hierarchies, or relationships in general between nodes. You can’t scale it. As documents get many and large, you want to be able to create as many tiers as possible.

LlamaIndex’s NodeRelationship class’s constructor attributes has “previous”, “next”, “parent”, and “child”. So, the options are there in the BaseNode class. These simple “front-back-up-and-down” attributes are enough to cover any hierarchy you can think of. But in DocumentSummaryIndex, the parent and child attributes were not used at all. It may be becasue the document summary index tool was built a while back. Next time I will just use LlamaIndex's tree classes directly (build my own structure if needed.)


Query (Retriever, Synthesizer, & Query Engine)

Dependencies & Setting

Imported libraries from Llama-Index and OpenAI. I used Settings to specify which LLM to use this time.

Load Index

Load the index files from disk into an index object (if you used a vector database, like Pinecone or Chromadb, load accordingly.)

Set up Questions

I set up 9 questions to test the model. The first 3 questions are direct and specific. The rest are more nuanced.

Retrieval, Synthesize, and Query

Simple Method (with Default Settings)

The high-level querying uses default parameters to retrieve, synthesize, and query, so that you don’t have to worry about configurations, and you can call it directly from the vectorstore class. But it’s very limiting, as you will see later in the results section. For example, it only retrieves the top match (search in sub-document summaries sub-document but retrieve the nodes underneath it) while most RAG models retrieve at least top 3.

Embedding Retrieval Method (Custom Settings)

For more custom retrievals, LlamIndex has built two tools specific to the tree summarization approach: the LLMRetriever method and the embeddingRetriever method (both are in the same file and share the same config). The former uses LLM to perform the query. The latter plugs the embedded query vector directly into the index for retrieval. I picked the latter because it should be faster and saves more API tokens.

How it works:

  • the retriever embeds a query, search and retrieve relevant nodes, and returns a retriever object (a nested dictionary containing node data)
  • the synthesizer takes in the object, combine retrieved text with the query into a new prompt, calls OpenAI API, and gets a response
  • query_engine works more like a pipeline

I instantiated the retriever object and set similarity_top to 3 instead of just 1 (top 3 results, not just 1.)

I used the retriever’s retrieve method to take the query, retrieved the top 3 results, and iterated to display the retrieved nodes. This is to show us what’s under the hood.

I then instantiated get_response_synthesizer. Response_mode was set to “tree_summerize.” (But the method allows you to customize a few other things, such as the prompt template to combine retrieve nodes and the query. But I left them to default.)

Finally, I used the RetrieverQueryEngine to activate the whole thing and performed the query.


Coding Example: Sentence Window Method

The code largely follows the “Metadata Replacement + Node Sentence Window” tutorial.

Build Knowledgebase

Dependencies & Settings

Import similar libraries as in the document summary index method.

[insert create_idx_sent_window_md_scr_shot_2_settings]

Instantiate OpenAI API with your API key. Here, I used Settings to specify which LLM to use. The window method is very token intensive. LlamaIndex recommended using a local embedding engine instead - HuggingFace’s embedding engine. I opted for OpenAI’s “text-embedding-ada-002” instead because my dataset is not that big.

Also, a different node parser was imported.

Read Documents

Load the documents using SimpleDirectoryReader, exactly like in the document summary index method.

Extract Nodes

[insert create_idx_sent_window_md_scr_shot_4_extract_nodes]

Indexing for sentence window is fairly straightforward. The only nuance is that “chunking” and indexing/embedding is a 3-step process:

  • Initialize the SentenceWindowNodeParser class using its factory method from_defaults(), which sets up an instance with specific default configurations;
  • Use this instance to call the main method get_nodes_from_documents to load in the document object and parse it into nodes,
  • And index.

The SentenceWindowNodeParser is a subclass of the NodeParser. It’s similar to a vanilla sentence parser, except for storing extra sentences in the metadata dictionary. Here we set the window_size to 3, metadata_keys to “window” and “original_text”. This means that each complete sentence is a node, but in the metadata dictionary, it is saving 7 sentences (3 before + 1 as itself + 3 after), and in metadata, we put the 7 under a key called “window” and the one sentence under “original_text.”

Note that the documents are also automatically broken into 409 small sub-documents, but here, there is no summarization nor hierarchy, the sub-documents (what LlamaIndex calls “documents” or “docs”) do not really matter.

Index and Save

Use the standard vectorstoreindex to index/embed.

Query

Dependencies & Settings

Dependencies and settings are mostly the same from the build index app.

Load Index

Load from index files to create an index instance.

Retrieve & Query

Set up query questions exactly the same as the other code example.

The MetadataReplacementPostProcessor class in LlamaIndex is used to replace the sentence in each node with its surrounding context.


Output

Here are selected response samples:

Document Summary Index

  • “Deloitte's revenue per employee in Norway can be calculated by dividing the total revenue by the number of employees.”
  • “To calculate Deloitte's revenue per employee in Malaysia, we would need specific information regarding the number of employees in Malaysia and the total revenue generated by Deloitte in Malaysia. Without this specific data, it is not possible to determine Deloitte's revenue per employee in Malaysia.”
  • “Deloitte provides industry-leading audit and assurance, tax and legal, consulting, financial advisory, and risk advisory services to clients.”

Method: Sentence Windows

  • Deloitte provides industry-leading audit and assurance, tax and legal, consulting, financial advisory, and risk advisory services to clients.
  • Deloitte's revenue per employee in Norway is not explicitly provided in the context information.

See GitHub Repository for the detailed output excel file.

See the main article, Comparative Analysis of Summary Index and Node Sentence Window Methods in RAG for Local Subsidiary Documents, for more in-depth analysis.


Thanks, I found this article incredibly informative. Navigating source code can be a pain, documentation wasn't helpful enough. Wish you a great day.

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics