Find answers from the community

Updated 4 months ago

pdf page metadata

At a glance
The community member is asking how to preserve PDF page numbers when extracting text from a long PDF document, so that the search results can include the page number. The comments suggest using the PdfReader library to extract the text and metadata, including the page number, and storing this information in a Document object. However, there is no explicitly marked answer in the provided information.
This is probably such a simple question and the answer is probably written someone on the Docs page, but I could not find it. How do I preserve a pdf page number for a long pdf , so that when getting vector search (or any other) results, it shows an excerpt + a page number? Thank you
b
M
5 comments
hello friend
Plain Text
 with open(path, 'rb') as f:
         pdf = PdfReader(f)
         print("Metadata: ", pdf.metadata)
         for page in pdf.pages:
           documents.append(Document(text=page.extract_text(), metadata={page_number: pageNumber}))
thanks! so , i read with PdfReader first, and this code will then add page numbers to the chunks of extracted texts?
ya you'd have to do like for key, page in pdf.pages:
or how ever you do a for loop in python πŸ™‚
Add a reply
Sign up and join the conversation on Discord