Find answers from the community

Updated 6 months ago

pdf page metadata

At a glance

The community member is asking how to preserve PDF page numbers when extracting text from a long PDF document, so that the search results can include the page number. The comments suggest using the PdfReader library to extract the text and metadata, including the page number, and storing this information in a Document object. However, there is no explicitly marked answer in the provided information.

MMitchMcD

This is probably such a simple question and the answer is probably written someone on the Docs page, but I could not find it. How do I preserve a pdf page number for a long pdf , so that when getting vector search (or any other) results, it shows an excerpt + a page number? Thank you

5 comments

bbmax

hello friend

bbmax

Plain Text

 with open(path, 'rb') as f:
         pdf = PdfReader(f)
         print("Metadata: ", pdf.metadata)
         for page in pdf.pages:
           documents.append(Document(text=page.extract_text(), metadata={page_number: pageNumber}))

MMitchMcD

thanks! so , i read with PdfReader first, and this code will then add page numbers to the chunks of extracted texts?

bbmax

ya you'd have to do like for key, page in pdf.pages:

bbmax

or how ever you do a for loop in python 🙂

Add a reply