Find answers from the community

Updated 2 years ago

When I read in a document in markdown

At a glance
When I read in a document in markdown format (originally an annual report in .pdf format) using the following, it turns it into ~100 documents.
documents = SimpleDirectoryReader(directory).load_data()

Any idea why this is happening? Some of the documents end up being two words; others end up being 100 words
y
c
L
13 comments
@confused_skelly
  1. I'm on the following version: 0.5.27
  2. Indeed, I only pass a single markdown file.
Here's an example file that leads to 99 documents
I think the markdown loader splits each header/section in markdown into its own document
@yaya90 is there a reason you've converted the pdf to markdown?
But a simple way to "fix" this issue is to use something like Unstructured.io to read the markdown into a string
and then pass the string into a Document object like shown in the second workflow on this link:
https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#load-in-documents
you don't have to use unstructured btw, it was the first thing that came to mind when you said PDF though
It does a pretty bang up job reading PDFs and even OCRs when it can't find text
8/10 would recommend
Appreciate the help! Is there any way for the markdown loader not to split each header/section in markdown into its own document?
Hmm I don't think so. But also, maybe I'm misunderstanding how it works

The source code is a little confusing haha and I'm just on my phone at the moment

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/readers/file/markdown_parser.py
Add a reply
Sign up and join the conversation on Discord