Find answers from the community

Updated 6 months ago

Hi, why does the SimpleDirectoryReader

At a glance

The community members discuss the behavior of the SimpleDirectoryReader when loading PDF files. The original post asks why the reader loads a PDF as multiple documents, one per page, and whether it's possible to load it as a single document.

In the comments, a community member explains that this is done to have metadata like page_label for each page. They suggest that the user can modify the PDFReader to create a single document for each PDF.

Another community member provides specific code examples on how to set up the LlamaParse or the default PDFReader to load a PDF as a single document, by setting the split_by_page or return_full_document options, respectively.

Hi, why does the SimpleDirectoryReader load a pdf as many documents, one per page? Is it possible to make it load as a single document?
W
R
2 comments
This is done so that you can have metadata like page_label for each page.

You can modify the PDFReader as per your requirement and create a single document for each PDF too.
yes, and just to add to it. here's how you setup llama-parse or the default pdf-reader to do that:

Plain Text
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader

parser = LlamaParse(split_by_page=False)
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()


the default pdf-reader:

Plain Text
from llama_index.readers.file import PDFReader
from llama_index.core import SimpleDirectoryReader

reader = PDFReader(return_full_document=True)
file_extractor = {".pdf": reader}
documents = SimpleDirectoryReader(
    "./data", file_extractor=file_extractor
).load_data()
Add a reply
Sign up and join the conversation on Discord