Find answers from the community

Updated 2 weeks ago

Preserving Page Numbers When Parsing Pdf To Text

Hello! I'm using llamaparse to parse a big PDF file, I do have the page numbers added to every page. The resulting file is now a txt/md file. If I load this file with the SimpleDirectoryReader I lose the information of which page number the text is coming from (it's obviously not part of the meta data anymore as the file does not have pages like a pdf anymore), if I'm lucky the page number is in the source node but that's not reliable enough.
How should I handle this?
L
k
4 comments
Parse the json result into Document or TextNode objects, theres a ton of useful metadata in there (including page numbers, filenames, etc.)

Plain Text
parser = LlamaParse(...)
json_results = parser.get_json_result(["file1.pdf", ...])
I can't remember the exact schema of the json result, but print out the first element in the list and check it out
From what I can tell the JSON result has all the elements from the original pdf in it with extra meta data. But how do I now load it into my vector store while preserving the page number?
The json gives you the text per page

Just iterate over it and create you documents /nodes

doc = Document(text=text, metadata={...})

Youll know the page number and file name to put in the metadata because you are iterating over the pages (also pretty sure it has page number somewhere in the json result)
Add a reply
Sign up and join the conversation on Discord