hey! just to clarify, are you just looking to summarize a document or are you specifically looking to 1) ask a set of questions, and 2) summarize over these questions?
If just for summarization you should try putting docs in a list index and doing
index.query
.
For the latter, you can create a new index from the response objects.
e.g.
index = GPTListIndex([r.response for r in responses])
index.query("What is a summary of this document?")
I was indeed looking for the later option. Also when I ask a question I was getting "it is not possible to answer this question with the given context" for context I was using list index with LLM and mode="tree_summarize" but I believe there was enough context in the document for that question, do you know how I could solve this issue?
Is this the right way to build the responses list?
responses = []
response1 = index.query(
"What were the year over year revenue trends in the period, and what drove that change",
response_mode='tree_summarize')
responses.append(response1.response)
I get the following trace back when I try the above code
Traceback (most recent call last):
File "C:\Users\surya\OneDrive\Documents\upwork\doc-summarizer\test.py", line 55, in <module>
index2 = GPTListIndex(responses, llm_predictor=llm_predictor,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\surya\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\list\base.py", line 54, in __init__
super().__init__(
File "C:\Users\surya\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\base.py", line 103, in __init__
documents = self._process_documents(
^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\surya\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\base.py", line 179, in _process_documents
raise ValueError(f"Invalid document type: {type(doc)}.")
ValueError: Invalid document type: <class 'str'>.
@mrmvp how are you constructing the index?
like this
index2 = GPTListIndex(responses, llm_predictor=llm_predictor,
prompt_helper=prompt_helper)
oh yeah assuming each response is the output of index.query, the
response
object is actually an object. So you may want to do
response_strs = [str(r) for r in responses]
index2 = GPTListIndex(response_strs, ...)
π I am still not able to get it to work here is the updated snippet
documents = SimpleDirectoryReader('data').load_data()
index = GPTListIndex(documents, llm_predictor=llm_predictor,
prompt_helper=prompt_helper)
responses = []
response1 = index.query(
"What were the year over year revenue trends in the period, and what drove that change",
response_mode='tree_summarize')
print(response1)
responses.append(response1.response)
response2 = index.query(
"What were the product or regional differences that drove revenue changes", response_mode='tree_summarize')
print(response2)
responses.append(response2.response)
response_strs = [str(r) for r in responses]
index2 = GPTListIndex(response_strs, llm_predictor=llm_predictor,
prompt_helper=prompt_helper)
index2.query("Summarize in less than 150 words.")
Do response_strs = [Document(str(r)) for r in responses]
instead
@jerryjliu0 if you don't mind please can you help me out with the following. I am trying to input a "sec filling" specifically a "10-q" document like this one
https://www.sec.gov/Archives/edgar/data/1045810/000104581022000147/nvda-20220731.htm I am currently using gpt list index and using embedding query with "tree_summarize" as response type and similarity_k=10. I am asking 4 to 5 questions regarding the document and storing the response in an array. I pass this array to a new list index and to summarize all the answers. Final summary is sometimes good but not consistent with other company fillings like it misses out on some key points in the answer to queries, so want to know is this a good approach using list index or is there any better method. Thanks
@jerryjliu0 gentle reminder on this π
@mrmvp oops missed this. are you using the list index? if so similarity_top_k shouldn't be a valid parameter (that's only for vector store indices). list index is in general good for summarization queries (if you want to go through the entire set of documents).
What is the query that you're passing in? Some of the output quality might depend on the query prompt you're passing in
No problem, Here are the questions I am trying to get answers for. I am also just focussing on one section of the 10-Q document which is the "management discussion and analysis"
What were the revenue trends year over year in the period, and what drove that change?
What were the cost challenges or benefits, both in terms of gross margins as well as operating costs?
What were the strategic decisions in the period, including any of M&A, investment, product and growth initiatives?
What were the trends in cashflow and working capital?
Overall, are management more or less confident about the prospects of the company?
If am using the mode "embedding" in list index can I not use the similarity_top_k ?
@mrmvp oops my bad, yes you can use similarity_top_k for mode="embedding"
You could try increasing the similarity_top_k for more detailed responses. Doing summarization queries purely with embedding-based retrieval can give mixed results. Of course, having the list index process more chunks can also be expensive
Lets take the first question as example, I am using a value of 10 for similarity_top_k and it still misses the part of text where there was discussion about year over year trend, does this happen because of using embedding mode ? Is there a better way to do this
if you're not super concerned about cost, you could try mode="default" with response_mode="tree_summarize", it'll go through every node in the document
Yeah let me try that and see how it goes. Thank you
you can also set required_keywords=["keyword1", "keyword2"]
during the index.query call to filter out keywords
Hey @jerryjliu0 , I have modified my prompt and it started to give better results but still for some documents I get
it is not possible to answer this question with the given context
, take a look at this log
Searching in chunk: is to first fund operations and investments in ...
Building index from nodes: 4 chunks
0/45, summary:
The revenue decreased in the third quarter of ...
10/45, summary:
The revenue decreased by 9.3% in the third qua...
20/45, summary:
The increase in revenue in the first nine mont...
30/45, summary:
In the third quarter and first nine months of ...
40/45, summary:
The year over year revenue trends in percentag...
Initial response:
The year over year revenue trends in percentage and what drove the revenue change cannot be determined from the given context information.
[query] Total LLM token usage: 13419 tokens
[query] Total embedding token usage: 11840 tokens
The year over year revenue trends in percentage and what drove the revenue change cannot be determined from the given context information
It is able to figure out the nodes which have the revenue trends but It returns
it is not possible to answer this question
Do you have any thoughts on this ? π
@mrmvp are you still using the vector index?
I am using list index with mode embedding and response mode tree_summarize
can you try mode="default"?
or increasing similarity_top_k for mode="embedding"
@mrmvp were you able to solve your issue?