Find answers from the community

Updated 2 days ago

I'm getting some weird behaviour from

At a glance

The community member is experiencing an issue with the SimpleDirectoryReader() function from the llamaparse library, where they are getting multiple document objects when loading a single file. The community member is wondering if this is intentional behaviour. The comments suggest that the issue may be related to the split_by_page parameter in the LlamaParse() function, and that setting it to False may resolve the problem.

I'm getting some weird behaviour from SimpleDirectoryReader() with llamaparse and wondering if it's intentional. When I load just one file I am ending up with multiple document objects.

Plain Text
parser = LlamaParse(
    result_type="markdown",
    verbose=True,
)
file_extractor = {".pdf": parser}
document = SimpleDirectoryReader(
  input_files=[pdf_path], # pdf_path is ONE file path. ie. './easy_data/example_file.pdf'
  file_extractor=file_extractor,
  filename_as_id=True,
).load_data(show_progress=True)

however, when I run len(document) i am getting a number > 1, which doesn't make sense. Any ideas what's going on?
L
s
6 comments
split_by_page defaults to true
LlamaParse(..., split_by_page=False)
will avoid that
any update on this ?
I still have the same with the latest version of LlamaParse:
parser = LlamaParse(
api_key=os.environ.get('LLAMA_API_KEY'),
result_type="markdown",
split_by_page=False,
num_workers=4,
verbose=True,
language="en")

documents = SimpleDirectoryReader(input_files=input_files,
exclude_empty=True,
filename_as_id=True,
file_extractor=file_extractor
).load_data(show_progress=True)

but it creates one doc per page of pdf...
@Logan M Thanks, I found the pb, had a typo in the file_extractor πŸ™„...
Add a reply
Sign up and join the conversation on Discord