Preserving Page Numbers When Parsing Pdf To Text

At a glance

The community member is using llamaparse to parse a PDF file, and the resulting text/markdown file no longer contains the page number information. They are looking for a way to preserve the page number metadata when loading the file into their vector store. The comments suggest parsing the JSON result from llamaparse, which contains the text per page along with metadata like page numbers and filenames. The recommended approach is to iterate over the JSON result, create Document or TextNode objects, and include the page number and filename in the metadata.

kkristian

Hello! I'm using llamaparse to parse a big PDF file, I do have the page numbers added to every page. The resulting file is now a txt/md file. If I load this file with the SimpleDirectoryReader I lose the information of which page number the text is coming from (it's obviously not part of the meta data anymore as the file does not have pages like a pdf anymore), if I'm lucky the page number is in the source node but that's not reliable enough.
How should I handle this?

4 comments

LLogan M

Parse the json result into Document or TextNode objects, theres a ton of useful metadata in there (including page numbers, filenames, etc.)

Plain Text

parser = LlamaParse(...)
json_results = parser.get_json_result(["file1.pdf", ...])

LLogan M

I can't remember the exact schema of the json result, but print out the first element in the list and check it out

kkristian

From what I can tell the JSON result has all the elements from the original pdf in it with extra meta data. But how do I now load it into my vector store while preserving the page number?

LLogan M

The json gives you the text per page

Just iterate over it and create you documents /nodes

doc = Document(text=text, metadata={...})

Youll know the page number and file name to put in the metadata because you are iterating over the pages (also pretty sure it has page number somewhere in the json result)

Add a reply

Find answers from the community

Preserving Page Numbers When Parsing Pdf To Text