Pdf reader

At a glance

A community member is upgrading an app from llama-index==0.5.40 to llama-index==0.6.11. They notice that the entire contents of a 3-page test PDF are added to the index in the 0.5.4 version, but only the first page in 0.6.11. The community member has looked through the documentation but is seeking hints or suggestions. Another community member suggests that the issue may be related to the num_pages value, and recommends testing the PDF reader code outside of the llama-index library to identify the problem.

Useful resources

eelmegatan26

I'm upgrading an app that uses llama-index==0.5.40 to llama-index==0.6.11. Using the example below, the entire contents of a 3 page test PDF are added to the index in the 0.5.4 version, but only the first page in 0.6.11. I've looked through the docs but any hints or suggestions are welcome. With the upgrade I also updated PyPDF2==3.0.1. to pypdf==3.9.0

Plain Text

index = GPTVectorStoreIndex([], service_context=service_context)
document = SimpleDirectoryReader(input_files=[doc_text]).load_data()[0]
index.insert(document)

2 comments

LLogan M

The code for the pdf reader is quite simple...
https://github.com/jerryjliu/llama_index/blob/4e29d1e7a2c55a031bebd1e69c51aebfa2cfdd61/llama_index/readers/file/docs_reader.py#L16

Maybe num_pages is somehow not correct? You could use this code to test outside of llama index to see where the issue is with your pdf

eelmegatan26

Thank you @Logan M

Add a reply

Find answers from the community

Pdf reader