Hello. I'm using the DocumentSummaryIndex class to summarise pdf documents. I load the pdfs with the SimpleDirectoryReader. The reader parses each page as a doc/node (not sure the right term). As a result, I get a summary for each page. This is ok, but what I'm really looking for is a summary if the overall pdf. I've tried extracting the summary nodes and use the build_index_from_nodes method on the index, but it does the same. Any suggestion? Including a different approach π
I'm not sure how to do that. I could create a new Document object and assign these Documents as child_nodes, but that doesn't seem very idiomatic. Any idea?
combined_documents = {}
for document in documents:
if document.metadata['file_name'] not in combined_documents:
combined_documents[document.metadata['file_name']] = document
else:
combined_documents[document.metadata['file_name']].text += "\n" + document.text
def summarise_top_document(document_path: Path, index: DocumentSummaryIndex) -> None: """ Create a top PDF summary. This is done starting from a document path, extract the document node summaries and merge them into a single document. The metadata of the first node is used as reference for the top document. The resulting summary node is inserted into the index. Note that summary nodes are labelled as such via its metadata. param: document_path: param: index: """ nodes: list[BaseNode] = _extract_document_node_summaries(document_path, index) merged_text: str = "\n\n\n".join([n.get_content() for n in nodes]) ref_node: BaseNode = deepcopy(nodes[0]) top_doc: BaseNode = deepcopy(ref_node) del top_doc.metadata["page_label"] top_doc.metadata["summary"] = True top_doc.set_content(merged_text) top_doc.excluded_embed_metadata_keys = ref_node.excluded_embed_metadata_keys top_doc.excluded_llm_metadata_keys = ref_node.excluded_llm_metadata_keys d = Document(metadata=top_doc.metadata, # type: ignore excluded_embed_metadata_keys=ref_node.excluded_embed_metadata_keys, excluded_llm_metadata_keys=ref_node.excluded_llm_metadata_keys) d.set_content(merged_text) process_document_object(d, filter="summary")