I think just using a SummaryIndex/ListIndex would be the best way. You will probably want to use a larger LLM though if you want the summary to be longer (i.e.
gpt-3.5-turbo-16k
)?
from llama_index import ServiceContext, SummaryIndex
from llama_index.llms import OpenAI
llm = OpenAI(model="gpt=3.5-turbo-16k", max_tokens=750)
ctx = ServiceContext.from_defaults(llm=llm)
index = SummaryIndex.from_defaults(documents, service_context=ctx)
query_engine = index.as_query_engine(response_mode="tree_summarize", use_async=True)
response = query_engine.query("Summarize this text.")
Here, the query engine builds a bottom-up tree of summaries using your query.
i.e. pairs of text chunks are summarized together, then pairs of summaries are summarized together, until it can return the root node
@Logan M thanks for the response. Wouldn't this still be kind of problematic for a long interaction? as an extreme case consider a book (I don't have any interactions this long, but I'm wondering about solving the problem this way).
@Logan M ah ok so in this tree case it is doing almost exactly what I described.
@Logan M and in this case it would call the LLM once for every chunk and chunk pair up to the root node or?
Yea exactly. But it's fairly effecient -- it tries to pack the LLM input as much as possible with every query
very cool. one last naive question on this: with regard to the model choice: does the 16k/30k larger options for the GPT models refer to the response length, the context length, or both? my interactions are quite long but they consist of spontaneous spoken conversations with a lot of 'filler', meaning that the context can be very long, but most of the time the summaries can be pretty short. lots of back and forth to deal with a couple of important issues.
They refer to the context window. LLMs generate one token at a time, add it to the input, and generate the next.
Therefore the max input size might be 4096, but if you want to generate 256 tokens, you need to "leave room" in the input to generate that many tokens.
LlamaIndex handles this for you, by looking at the max_tokens setting for the LLM
llm = OpenAI(model="gpt=3.5-turbo-16k", max_tokens=750)
or maybe it doesn't really matter due to the tree nature of the summarization
ok got it. so using the larger models will make it possible to process the calls in fewer, larger chunks, and the max_tokens will influence the fraction of those tokens that should be considered on the input side i guess.
@Logan M kapa.ai answered that last one. thank you so much. just wanted to also say: i'm really loving llamaindex, but this discord, including both the human responses and the kapa.ai bot makes it truly amazing.
Glad it's been useful! π
mmm is "tree_summarize" going to be called for every query? or is it applied only at indexing time?
every query -- and it has to be, since the actual summary is directed by the query
If you wanted, you could run the summary once, and cache it, if it isn't going to be changed?
@Logan M that is correct, it should not change. this is a process that is run one time after call resolution/hangup. I'd like to produce the summary and then store the summary itself as another chunk or metadata element in the vector index.
in this case it looks like kapa.ai is suggesting this:
from llama_index.response_synthesizers import TreeSummarize
summarizer = TreeSummarize(verbose=True)
response = await summarizer.aget_response("who is Paul Graham?", [text])
print(response)
That could work too yes -- a much more low-level approach π
and then i can just store final summary
in the vector index i guess
time for some actual exploration i guess. thanks again!