Find answers from the community

Updated 2 months ago

Confirming the use of SummaryIndex for summarizing pdf documents

I have around 1000 pdf documents (slides, scientific publications etc.). I want to create summary of each document. As per my understanding i need to use SummaryIndex(https://docs.llamaindex.ai/en/stable/api_reference/indices/summary/) and not DocumentSummaryIndex(https://docs.llamaindex.ai/en/stable/examples/index_structs/doc_summary/DocSummary/). Can someone confirm? And any tips before i setup the pipeline.
W
J
12 comments
If you need summary of each document then you need to use DocumentSummaryIndex.

SummaryIndex basically pulls all the nodes and iterate over them one by one to form the final answer. This would increase the time taken to respond.

Whereas DocumentSummaryIndex is a one time effort that comes while building your index. Each Document object goes through LLM and and generates a summary against it.
1) It's only a one time activity. Since I want to generate a document that will contain summary of all the papers (1st column document name and 2nd column document summary) so I am not worry about time taken to response. Ultimatey I want to append this summary document into the persisting DuckDB vector store. 2) Which one is better, creating summary document and adding it into persisting DuckDB vector store or create the index and attach this new index to persisting DuckDB vector store file? 3) How to print summary generated by DocumentSummaryIndex for a given document and shall i loop through all documents one by one or dump all of them and then print individual summary to a payload file while going through each doc_ID.
Can you explain more about SummaryIndex as it's documentation is not detail rich (https://docs.llamaindex.ai/en/stable/api_reference/indices/summary/); like when you say 'pulls all the nodes', how do it does that, based upon summary or some filter/prompt tokens?.
As per my understanding, there is no implicit LLM powered summary generation step involved (like backend summary generation step in DocumentSummaryIndex) despite the word Summary in it's name SummaryIndex, correct? If correct, just curious, why to name it SummaryIndex?
Does both method insures that all the nodes belonging to a document are considered while summarizing? or it may be a hit or miss in SummaryIndex since its summarization will be solely based upon prompt instructions and there is no paramteric control to force the system to consider all the nodes blonging to a document while generating summary?
So in summary index, once you ask a query, it iterates through each node that you have and pass it to the LLM along with query and previous context to form a updated answer. This goes on until all the nodes are consumed .
This is a costly approach as you will iterate over all the nodes for every query.


Whereas in DocumentSummaryIndex Before chunking documents into Nodes , Summary is created over the whole Document and it is attached to the chunked nodes of that document.

This is one time cost
correct me if I'm wrong: If i load docA and it got split into nodeA, nodeB and nodeC(a sequence respecting the source document's content's sequence); to answer a query, it will first pass the nodeA + query + previous context(if any) to the LLM and get answerA; then it will pass nodeB + query + previous context(function of answerA) to the LLM and get answerB; finally it will pass nodeC + query + previous context(function of answerA+answerB) to get answerC; and that answerC will be displayed to the user as a answer to the query. Am i correct?
so it's called SummaryIndex because it is touching all the nodes and not because it's summarizing all the nodes on backend like DocumentSummaryIndex does? right?
I am coming across following error for 1600 document objects
my code snippet
Attachment
image.png
Add a reply
Sign up and join the conversation on Discord