What is the best way to generate a

cchsurf

What is the best way to generate a summary for a (very) long document? I asked kapa.ai about this but the response regarding using the SummaryExtractor seems to only really cover enlarging the summary context for a particular chunk with the additional context of the previous and following chunks. Is there any other general best practice on this topic? If not I was considering a kind of multi-query sequential update. In this case we'd start with a long document, say 20 pages that doesn't fit in any context, then we provide a summary prompt template that says, keeping in mind the summary up to this point and anything else you remember, please update the existing summary with this new chunk". The first chunk would start with the max context and provide a summary of that (first 2 pages or whatever), then each subsequent request would provide the next page(s) plus the partial summary and request that the summary be updated with the new information. This continues until the entire document has been read. It doesn't seem like a particularly novel idea on my part but I couldn't find an example of something like this. Would it work?

21 comments

LLogan M

I think just using a SummaryIndex/ListIndex would be the best way. You will probably want to use a larger LLM though if you want the summary to be longer (i.e. gpt-3.5-turbo-16k)?

Plain Text

from llama_index import ServiceContext, SummaryIndex
from llama_index.llms import OpenAI

llm = OpenAI(model="gpt=3.5-turbo-16k", max_tokens=750)
ctx = ServiceContext.from_defaults(llm=llm)

index = SummaryIndex.from_defaults(documents, service_context=ctx)
query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=True)

response = query_engine.query("Summarize this text.")

LLogan M

Here, the query engine builds a bottom-up tree of summaries using your query.

i.e. pairs of text chunks are summarized together, then pairs of summaries are summarized together, until it can return the root node

cchsurf

@Logan M thanks for the response. Wouldn't this still be kind of problematic for a long interaction? as an extreme case consider a book (I don't have any interactions this long, but I'm wondering about solving the problem this way).

cchsurf

@Logan M ah ok so in this tree case it is doing almost exactly what I described.

cchsurf

@Logan M and in this case it would call the LLM once for every chunk and chunk pair up to the root node or?

LLogan M

Yea exactly. But it's fairly effecient -- it tries to pack the LLM input as much as possible with every query

cchsurf

very cool. one last naive question on this: with regard to the model choice: does the 16k/30k larger options for the GPT models refer to the response length, the context length, or both? my interactions are quite long but they consist of spontaneous spoken conversations with a lot of 'filler', meaning that the context can be very long, but most of the time the summaries can be pretty short. lots of back and forth to deal with a couple of important issues.

LLogan M

They refer to the context window. LLMs generate one token at a time, add it to the input, and generate the next.

Therefore the max input size might be 4096, but if you want to generate 256 tokens, you need to "leave room" in the input to generate that many tokens.

LlamaIndex handles this for you, by looking at the max_tokens setting for the LLM

llm = OpenAI(model="gpt=3.5-turbo-16k", max_tokens=750)

cchsurf

or maybe it doesn't really matter due to the tree nature of the summarization

cchsurf

ok got it. so using the larger models will make it possible to process the calls in fewer, larger chunks, and the max_tokens will influence the fraction of those tokens that should be considered on the input side i guess.

cchsurf

@Logan M kapa.ai answered that last one. thank you so much. just wanted to also say: i'm really loving llamaindex, but this discord, including both the human responses and the kapa.ai bot makes it truly amazing.

LLogan M

Glad it's been useful! 🙏

cchsurf

mmm is "tree_summarize" going to be called for every query? or is it applied only at indexing time?

LLogan M

every query -- and it has to be, since the actual summary is directed by the query

LLogan M

If you wanted, you could run the summary once, and cache it, if it isn't going to be changed?

cchsurf

@Logan M that is correct, it should not change. this is a process that is run one time after call resolution/hangup. I'd like to produce the summary and then store the summary itself as another chunk or metadata element in the vector index.

cchsurf

in this case it looks like kapa.ai is suggesting this:

from llama_index.response_synthesizers import TreeSummarize

summarizer = TreeSummarize(verbose=True)
response = await summarizer.aget_response("who is Paul Graham?", [text])
print(response)

LLogan M

That could work too yes -- a much more low-level approach 🙂

cchsurf

and then i can just store final summary

cchsurf

in the vector index i guess

cchsurf

time for some actual exploration i guess. thanks again!

Add a reply

Find answers from the community

What is the best way to generate a