Find answers from the community

Home
Members
AveSharia
A
AveSharia
Offline, last seen 3 months ago
Joined September 25, 2024
Is there some reason querying a GPTListIndex built from four individual PDF documents and PDFReader() would execute extremely slow and inefficiently compared to the same query to a GPTSimpleVectorIndex built on the same four PDFs built from a SimpleDirectoryReader?

Explanation: I built an index using SimpleDirectoryReader:

Plain Text
documents = SimpleDirectoryReader('memories/book_club/').load_data()
index = GPTSimpleVectorIndex.from_documents(documents)
response1 = index.query("Who is your papa, and what does he do?") #It's a long story


After indexing, the search executes in about a second, and costs nearly nothing. However, I wanted to be able to track the filenames that results are returning from, so I build a 'GPTListIndex' from the files themselves:

Plain Text
book_list = ['memories/book_club/pdf1.pdf', 'memories/book_club/pdf1.pdf',
               'memories/book_club/pdf3.pdf','memories/book_club/pdf4.pdf']
llama_pdf_reader = download_loader("PDFReader")
llama_pdf_loader = llama_pdf_reader()
pdf_documents = [llama_pdf_loader.load_data(file=Path(book),
                 extra_info={"filename":book}) for book in book_list]
index2 = GPTListIndex.from_documents([pdf_document[0] for pdf_document in pdf_documents])
response2 = index2.query("Who is your papa, and what does he do?")


This cell ran for 30m and 5s without completing before I interrupted the kernel. It made 214 OpenAI text-davinci API calls, for about $11.

I'm sure I did something wrong, but these are the same source documents
10 comments
B
L
A