Thanks Logan! Right now I'm doing like this:
documents = []
with pdfplumber.open(io.BytesIO(pdf)) as pdf_file:
pages = pdf_file.pages
for i, page in enumerate(pages):
text = page.extract_text()
text = clean_text(text)
documents.append(
Document(
text=text,
doc_id=f"doc_{i}",
extra_info={
"document": file_info["company_name"],
"company": file_info["company_name"],
"type": file_info["document_type"],
"year": file_info["document_year"],
"page": i
}
)
)
index = GPTSimpleVectorIndex.from_documents(documents=documents, service_context=service_context)
So I'm splitting a pdf into a list of
Documents
. As I understand this is same as If I split the pdf into a list of
Nodes
and create an index like
GPTSimpleVectorIndex(nodes)
, right?