Find answers from the community

Updated 2 years ago

When I read in a document in markdown

At a glance

When I read in a document in markdown format (originally an annual report in .pdf format) using the following, it turns it into ~100 documents.
documents = SimpleDirectoryReader(directory).load_data()

Any idea why this is happening? Some of the documents end up being two words; others end up being 100 words

13 comments

yyaya90

@confused_skelly

I'm on the following version: 0.5.27
Indeed, I only pass a single markdown file.

yyaya90

Here's an example file that leads to 99 documents

cconfused_skelly

Let me load it up

LLogan M

I think the markdown loader splits each header/section in markdown into its own document

cconfused_skelly

Ah that's it then

cconfused_skelly

@yaya90 is there a reason you've converted the pdf to markdown?

cconfused_skelly

But a simple way to "fix" this issue is to use something like Unstructured.io to read the markdown into a string

cconfused_skelly

and then pass the string into a Document object like shown in the second workflow on this link:
https://gpt-index.readthedocs.io/en/latest/guides/primer/usage_pattern.html#load-in-documents

cconfused_skelly

you don't have to use unstructured btw, it was the first thing that came to mind when you said PDF though

cconfused_skelly

It does a pretty bang up job reading PDFs and even OCRs when it can't find text

cconfused_skelly

8/10 would recommend

yyaya90

Appreciate the help! Is there any way for the markdown loader not to split each header/section in markdown into its own document?

LLogan M

Hmm I don't think so. But also, maybe I'm misunderstanding how it works

The source code is a little confusing haha and I'm just on my phone at the moment

https://github.com/jerryjliu/llama_index/blob/main/gpt_index/readers/file/markdown_parser.py

Add a reply