I'm getting some weird behaviour from

Question

I'm getting some weird behaviour from SimpleDirectoryReader() with llamaparse and wondering if it's intentional. When I load just one file I am ending up with multiple document objects.parser = LlamaParse( result_type="markdown", verbose=True,
)
file_extractor = {".pdf": parser}
document = SimpleDirectoryReader( input_files=[pdf_path], # pdf_path is ONE file path. ie. './easy_data/example_file.pdf' file_extractor=file_extractor, filename_as_id=True,
).load_data(show_progress=True)however, when I run len(document) i am getting a number > 1, which doesn't make sense. Any ideas what's going on?

Logan M · Answer

split_by_page defaults to true

Logan M · Answer

LlamaParse(..., split_by_page=False)

Logan M · Answer

will avoid that

syolbe · Answer

any update on this ?
I still have the same with the latest version of LlamaParse:
parser = LlamaParse(
api_key=os.environ.get('LLAMA_API_KEY'),
result_type="markdown",
split_by_page=False,
num_workers=4,
verbose=True,
language="en")

documents = SimpleDirectoryReader(input_files=input_files,
exclude_empty=True,
filename_as_id=True,
file_extractor=file_extractor
).load_data(show_progress=True)

but it creates one doc per page of pdf...

Logan M · Answer

@syolbe works fine for mehttps://colab.research.google.com/drive/1974k5nSGOF4BrbvdsAG0sRX2fqAH_ByS?usp=sharing

syolbe · Answer

@Logan M Thanks, I found the pb, had a typo in the file_extractor 🙄...

Find answers from the community

I'm getting some weird behaviour from