Find answers from the community

Updated 4 months ago

Pdf reader

At a glance

A community member is upgrading an app from llama-index==0.5.40 to llama-index==0.6.11. They notice that the entire contents of a 3-page test PDF are added to the index in the 0.5.4 version, but only the first page in 0.6.11. The community member has looked through the documentation but is seeking hints or suggestions. Another community member suggests that the issue may be related to the num_pages value, and recommends testing the PDF reader code outside of the llama-index library to identify the problem.

Useful resources
I'm upgrading an app that uses llama-index==0.5.40 to llama-index==0.6.11. Using the example below, the entire contents of a 3 page test PDF are added to the index in the 0.5.4 version, but only the first page in 0.6.11. I've looked through the docs but any hints or suggestions are welcome. With the upgrade I also updated PyPDF2==3.0.1. to pypdf==3.9.0
Plain Text
index = GPTVectorStoreIndex([], service_context=service_context)
document = SimpleDirectoryReader(input_files=[doc_text]).load_data()[0]
index.insert(document)
L
e
2 comments
The code for the pdf reader is quite simple...
https://github.com/jerryjliu/llama_index/blob/4e29d1e7a2c55a031bebd1e69c51aebfa2cfdd61/llama_index/readers/file/docs_reader.py#L16

Maybe num_pages is somehow not correct? You could use this code to test outside of llama index to see where the issue is with your pdf
Thank you @Logan M
Add a reply
Sign up and join the conversation on Discord