Pdfs

At a glance

The community member is working on an MVP (Minimum Viable Product) that deals with cases before a human rights board. They have two cases in a 'cases/' directory, one 12 pages long and the other 7 pages long. The community member is using a SimpleDirectoryReader to load the PDFs, but each page is being loaded as a separate Document object. This is causing issues with GPT-4, as it is mixing facts from the different cases and providing incorrect answers. The community member believes they will get better results if each case is its own Document object. They are looking for a way to load an entire PDF as a single Document object. In the comments, another community member suggests loading each PDF individually and then combining all the text into a single Document object, which the original poster says they will look into.

BBP

hey guys, my mvp deals with cases (in law) before a human rights board.

I have two cases in my 'cases/' directory and the first is 12 pages with the second being 7 pages long. I notice that the simple loader here:

Plain Text

reader = SimpleDirectoryReader(
    input_dir="cases/"
)

documents = reader.load_data()

is loading the pdfs into a list of Document objects, but the thing is--it's loading 1 page as a single Document object. I turn each Document object into a node and put all 19 nodes into my vector store.
Unfortunately, gpt-4 is mixing facts from each case and giving wrong answers.

I think I'll get better results if each case was its own Document object, and subsequently it's own Node. Does one of the default loaders have the ability to load an entire pdf as one Document object?
I swear I watched a tutorial on this but I've been looking for it and can't find it for the life of me. Send halp please 🙏 ❤️

2 comments

LLogan M

You could just load each pdf (rather than the directory), and then combine all the text into a single document object 🤔

BBP

thank you! I'll look into this 🫡

Add a reply

Find answers from the community

Pdfs