LlamaIndex

Log inLog into community

Find answers from the community

Updated 2 years ago

Hey All I am using gpt index to

Hey All I am using gpt index to

At a glance

·

Hey All. I am using gpt index to summarize a document. What I would like to do is ask a set of questions using the query and generate one final summary of all the answers from the questions. Can you please guide me on this ?

j

m

N

28 comments

hey! just to clarify, are you just looking to summarize a document or are you specifically looking to 1) ask a set of questions, and 2) summarize over these questions?

If just for summarization you should try putting docs in a list index and doing index.query.

For the latter, you can create a new index from the response objects.

e.g.

Plain Text

index = GPTListIndex([r.response for r in responses])
index.query("What is a summary of this document?")

I was indeed looking for the later option. Also when I ask a question I was getting "it is not possible to answer this question with the given context" for context I was using list index with LLM and mode="tree_summarize" but I believe there was enough context in the document for that question, do you know how I could solve this issue?

Is this the right way to build the responses list?

responses = []
response1 = index.query(
    "What were the year over year revenue trends in the period, and what drove that change",
    response_mode='tree_summarize')

responses.append(response1.response)

I get the following trace back when I try the above code

Traceback (most recent call last):
  File "C:\Users\surya\OneDrive\Documents\upwork\doc-summarizer\test.py", line 55, in <module> 
    index2 = GPTListIndex(responses, llm_predictor=llm_predictor,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\surya\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\list\base.py", line 54, in __init__
    super().__init__(
  File "C:\Users\surya\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\base.py", line 103, in __init__
    documents = self._process_documents(
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\surya\AppData\Local\Programs\Python\Python311\Lib\site-packages\gpt_index\indices\base.py", line 179, in _process_documents
    raise ValueError(f"Invalid document type: {type(doc)}.")
ValueError: Invalid document type: <class 'str'>.

@mrmvp how are you constructing the index?

like this


index2 = GPTListIndex(responses, llm_predictor=llm_predictor,
                      prompt_helper=prompt_helper)

oh yeah assuming each response is the output of index.query, the response object is actually an object. So you may want to do

Plain Text

response_strs = [str(r) for r in responses]
index2 = GPTListIndex(response_strs, ...)

😕 I am still not able to get it to work here is the updated snippet

documents = SimpleDirectoryReader('data').load_data()

index = GPTListIndex(documents, llm_predictor=llm_predictor,
                     prompt_helper=prompt_helper)

responses = []
response1 = index.query(
    "What were the year over year revenue trends in the period, and what drove that change",
    response_mode='tree_summarize')
print(response1)
responses.append(response1.response)

response2 = index.query(
    "What were the product or regional differences that drove revenue changes", response_mode='tree_summarize')

print(response2)
responses.append(response2.response)

response_strs = [str(r) for r in responses]

index2 = GPTListIndex(response_strs, llm_predictor=llm_predictor,
                      prompt_helper=prompt_helper)
index2.query("Summarize in less than 150 words.")

oh oops

Do response_strs = [Document(str(r)) for r in responses] instead

Perfect! works now 🙂

Thanks a ton

@jerryjliu0 if you don't mind please can you help me out with the following. I am trying to input a "sec filling" specifically a "10-q" document like this one https://www.sec.gov/Archives/edgar/data/1045810/000104581022000147/nvda-20220731.htm
I am currently using gpt list index and using embedding query with "tree_summarize" as response type and similarity_k=10. I am asking 4 to 5 questions regarding the document and storing the response in an array. I pass this array to a new list index and to summarize all the answers. Final summary is sometimes good but not consistent with other company fillings like it misses out on some key points in the answer to queries, so want to know is this a good approach using list index or is there any better method. Thanks

@jerryjliu0 gentle reminder on this 😀

@mrmvp oops missed this. are you using the list index? if so similarity_top_k shouldn't be a valid parameter (that's only for vector store indices). list index is in general good for summarization queries (if you want to go through the entire set of documents).

What is the query that you're passing in? Some of the output quality might depend on the query prompt you're passing in

No problem, Here are the questions I am trying to get answers for. I am also just focussing on one section of the 10-Q document which is the "management discussion and analysis"

What were the revenue trends year over year in the period, and what drove that change?

What were the cost challenges or benefits, both in terms of gross margins as well as operating costs?

What were the strategic decisions in the period, including any of M&A, investment, product and growth initiatives?

What were the trends in cashflow and working capital?

Overall, are management more or less confident about the prospects of the company?

If am using the mode "embedding" in list index can I not use the similarity_top_k ?

@mrmvp oops my bad, yes you can use similarity_top_k for mode="embedding"

You could try increasing the similarity_top_k for more detailed responses. Doing summarization queries purely with embedding-based retrieval can give mixed results. Of course, having the list index process more chunks can also be expensive

Lets take the first question as example, I am using a value of 10 for similarity_top_k and it still misses the part of text where there was discussion about year over year trend, does this happen because of using embedding mode ? Is there a better way to do this

if you're not super concerned about cost, you could try mode="default" with response_mode="tree_summarize", it'll go through every node in the document

Yeah let me try that and see how it goes. Thank you

you can also set required_keywords=["keyword1", "keyword2"] during the index.query call to filter out keywords

Hey @jerryjliu0 , I have modified my prompt and it started to give better results but still for some documents I get it is not possible to answer this question with the given context, take a look at this log

Searching in chunk: is to first fund operations and investments in ...
Building index from nodes: 4 chunks
0/45, summary:
The revenue decreased in the third quarter of ...
10/45, summary:
The revenue decreased by 9.3% in the third qua...
20/45, summary:
The increase in revenue in the first nine mont...
30/45, summary:
In the third quarter and first nine months of ...
40/45, summary:
The year over year revenue trends in percentag...
Initial response:
The year over year revenue trends in percentage and what drove the revenue change cannot be determined from the given context information.
[query] Total LLM token usage: 13419 tokens
[query] Total embedding token usage: 11840 tokens

The year over year revenue trends in percentage and what drove the revenue change cannot be determined from the given context information

It is able to figure out the nodes which have the revenue trends but It returns it is not possible to answer this question

Do you have any thoughts on this ? 😀

@mrmvp are you still using the vector index?

I am using list index with mode embedding and response mode tree_summarize

can you try mode="default"?

or increasing similarity_top_k for mode="embedding"

@mrmvp were you able to solve your issue?

Add a reply

Sign up and join the conversation on Discord