Is there some reason querying a
GPTListIndex
built from four individual PDF documents and
PDFReader()
would execute extremely slow and inefficiently compared to the same query to a
GPTSimpleVectorIndex
built on the same four PDFs built from a
SimpleDirectoryReader
?
Explanation: I built an index using
SimpleDirectoryReader
:
documents = SimpleDirectoryReader('memories/book_club/').load_data()
index = GPTSimpleVectorIndex.from_documents(documents)
response1 = index.query("Who is your papa, and what does he do?") #It's a long story
After indexing, the search executes in about a second, and costs nearly nothing. However, I wanted to be able to track the filenames that results are returning from, so I build a 'GPTListIndex' from the files themselves:
book_list = ['memories/book_club/pdf1.pdf', 'memories/book_club/pdf1.pdf',
'memories/book_club/pdf3.pdf','memories/book_club/pdf4.pdf']
llama_pdf_reader = download_loader("PDFReader")
llama_pdf_loader = llama_pdf_reader()
pdf_documents = [llama_pdf_loader.load_data(file=Path(book),
extra_info={"filename":book}) for book in book_list]
index2 = GPTListIndex.from_documents([pdf_document[0] for pdf_document in pdf_documents])
response2 = index2.query("Who is your papa, and what does he do?")
This cell ran for 30m and 5s without completing before I interrupted the kernel. It made 214 OpenAI text-davinci API calls, for about $11.
I'm sure I did something wrong, but these are the same source documents