Is there some reason querying a

At a glance

Is there some reason querying a GPTListIndex built from four individual PDF documents and PDFReader() would execute extremely slow and inefficiently compared to the same query to a GPTSimpleVectorIndex built on the same four PDFs built from a SimpleDirectoryReader?

Explanation: I built an index using SimpleDirectoryReader:

Plain Text

documents = SimpleDirectoryReader('memories/book_club/').load_data()
index = GPTSimpleVectorIndex.from_documents(documents)
response1 = index.query("Who is your papa, and what does he do?") #It's a long story

After indexing, the search executes in about a second, and costs nearly nothing. However, I wanted to be able to track the filenames that results are returning from, so I build a 'GPTListIndex' from the files themselves:

Plain Text

book_list = ['memories/book_club/pdf1.pdf', 'memories/book_club/pdf1.pdf',
               'memories/book_club/pdf3.pdf','memories/book_club/pdf4.pdf']
llama_pdf_reader = download_loader("PDFReader")
llama_pdf_loader = llama_pdf_reader()
pdf_documents = [llama_pdf_loader.load_data(file=Path(book),
                 extra_info={"filename":book}) for book in book_list]
index2 = GPTListIndex.from_documents([pdf_document[0] for pdf_document in pdf_documents])
response2 = index2.query("Who is your papa, and what does he do?")

This cell ran for 30m and 5s without completing before I interrupted the kernel. It made 214 OpenAI text-davinci API calls, for about $11.

I'm sure I did something wrong, but these are the same source documents

10 comments

LLogan M

The list index will query EVERY node to find the answer. (this can be slow, and use many tokens)

Whereas by default, the vector index only uses the top 1 closest matching node compared to the query (you can also set this to be a bit higher, maybe top 3 or top 5)

usually, a list index is better for summarization, or for queries where you need to check every node

AAveSharia

Interesting. So if I want a synthesis from multiple sources, but I don't want to search every node, I guess I should make one GPTSimpleVectorIndex for each file, retrieve the closest matching node from each of those, and form a synthesis from the results?

LLogan M

Yea exactly. In a large majority of cases, a vector index will do fine.

response = index.query("my query", similarity_top_k=3, response_mode="compact")

You can adjust the top_k, and optionally set the response mode (compact stuffs as much text as possible into each LLM call, rather than one call per node. This is extra helpful for response times if you decrease the chunk size)

If you want to track filenames, you can do this

Plain Text

filename_fn = lambda filename: {'file_name': filename}
documents = SimpleDirectoryReader('./data_dir', file_metadata=filename_fn).load_data

Then you can check the response object to see the sources
response.source_nodes[0].node.node_info

LLogan M

Sorry, I'm throwing a ton of info at you haha

AAveSharia

Dude you're good. I really appreciate it. Navigating the docs has been interesting 😂

LLogan M

haha glad to help! I think the docs give a good "taste" of everything, but there are so many features and options it's hard to get everything across 😅

AAveSharia

I get the sense that this is changing a lot too. I get errors running code from posts from like a month ago.

LLogan M

yea, around v0.5.0, some breaking changes were introduced (needed to better support features going forward). It's slowed down a little bit the past week or so thankfully haha

BBioHacker

And of course you can set response_mode to tree summarize \when using a vector index.

BBioHacker

I’ve found that it gives you solid summary while having much lower cost and much faster response.

Add a reply

Find answers from the community

Is there some reason querying a