Find answers from the community

Updated 3 months ago

Is there some reason querying a

Is there some reason querying a GPTListIndex built from four individual PDF documents and PDFReader() would execute extremely slow and inefficiently compared to the same query to a GPTSimpleVectorIndex built on the same four PDFs built from a SimpleDirectoryReader?

Explanation: I built an index using SimpleDirectoryReader:

Plain Text
documents = SimpleDirectoryReader('memories/book_club/').load_data()
index = GPTSimpleVectorIndex.from_documents(documents)
response1 = index.query("Who is your papa, and what does he do?") #It's a long story


After indexing, the search executes in about a second, and costs nearly nothing. However, I wanted to be able to track the filenames that results are returning from, so I build a 'GPTListIndex' from the files themselves:

Plain Text
book_list = ['memories/book_club/pdf1.pdf', 'memories/book_club/pdf1.pdf',
               'memories/book_club/pdf3.pdf','memories/book_club/pdf4.pdf']
llama_pdf_reader = download_loader("PDFReader")
llama_pdf_loader = llama_pdf_reader()
pdf_documents = [llama_pdf_loader.load_data(file=Path(book),
                 extra_info={"filename":book}) for book in book_list]
index2 = GPTListIndex.from_documents([pdf_document[0] for pdf_document in pdf_documents])
response2 = index2.query("Who is your papa, and what does he do?")


This cell ran for 30m and 5s without completing before I interrupted the kernel. It made 214 OpenAI text-davinci API calls, for about $11.

I'm sure I did something wrong, but these are the same source documents
L
A
B
10 comments
The list index will query EVERY node to find the answer. (this can be slow, and use many tokens)

Whereas by default, the vector index only uses the top 1 closest matching node compared to the query (you can also set this to be a bit higher, maybe top 3 or top 5)

usually, a list index is better for summarization, or for queries where you need to check every node
Interesting. So if I want a synthesis from multiple sources, but I don't want to search every node, I guess I should make one GPTSimpleVectorIndex for each file, retrieve the closest matching node from each of those, and form a synthesis from the results?
Yea exactly. In a large majority of cases, a vector index will do fine.

response = index.query("my query", similarity_top_k=3, response_mode="compact")

You can adjust the top_k, and optionally set the response mode (compact stuffs as much text as possible into each LLM call, rather than one call per node. This is extra helpful for response times if you decrease the chunk size)

If you want to track filenames, you can do this

Plain Text
filename_fn = lambda filename: {'file_name': filename}
documents = SimpleDirectoryReader('./data_dir', file_metadata=filename_fn).load_data


Then you can check the response object to see the sources
response.source_nodes[0].node.node_info
Sorry, I'm throwing a ton of info at you haha
Dude you're good. I really appreciate it. Navigating the docs has been interesting πŸ˜‚
haha glad to help! I think the docs give a good "taste" of everything, but there are so many features and options it's hard to get everything across πŸ˜…
I get the sense that this is changing a lot too. I get errors running code from posts from like a month ago.
yea, around v0.5.0, some breaking changes were introduced (needed to better support features going forward). It's slowed down a little bit the past week or so thankfully haha
And of course you can set response_mode to tree summarize \when using a vector index.
I’ve found that it gives you solid summary while having much lower cost and much faster response.
Add a reply
Sign up and join the conversation on Discord