Find answers from the community

Updated 3 months ago

Pdfs

hey guys, my mvp deals with cases (in law) before a human rights board.

I have two cases in my 'cases/' directory and the first is 12 pages with the second being 7 pages long. I notice that the simple loader here:

Plain Text
reader = SimpleDirectoryReader(
    input_dir="cases/"
)

documents = reader.load_data()


is loading the pdfs into a list of Document objects, but the thing is--it's loading 1 page as a single Document object. I turn each Document object into a node and put all 19 nodes into my vector store.
Unfortunately, gpt-4 is mixing facts from each case and giving wrong answers.

I think I'll get better results if each case was its own Document object, and subsequently it's own Node. Does one of the default loaders have the ability to load an entire pdf as one Document object?
I swear I watched a tutorial on this but I've been looking for it and can't find it for the life of me. Send halp please πŸ™ ❀️
L
B
2 comments
You could just load each pdf (rather than the directory), and then combine all the text into a single document object πŸ€”
thank you! I'll look into this 🫑
Add a reply
Sign up and join the conversation on Discord