The community member is asking how to preserve PDF page numbers when extracting text from a long PDF document, so that the search results can include the page number. The comments suggest using the PdfReader library to extract the text and metadata, including the page number, and storing this information in a Document object. However, there is no explicitly marked answer in the provided information.
This is probably such a simple question and the answer is probably written someone on the Docs page, but I could not find it. How do I preserve a pdf page number for a long pdf , so that when getting vector search (or any other) results, it shows an excerpt + a page number? Thank you
with open(path, 'rb') as f:
pdf = PdfReader(f)
print("Metadata: ", pdf.metadata)
for page in pdf.pages:
documents.append(Document(text=page.extract_text(), metadata={page_number: pageNumber}))