Hello. I'm using the

At a glance

Hello. I'm using the DocumentSummaryIndex class to summarise pdf documents. I load the pdfs with the SimpleDirectoryReader. The reader parses each page as a doc/node (not sure the right term). As a result, I get a summary for each page. This is ok, but what I'm really looking for is a summary if the overall pdf. I've tried extracting the summary nodes and use the build_index_from_nodes method on the index, but it does the same. Any suggestion? Including a different approach 🙂

7 comments

LLogan M

Combine the output of SimpleDirectoryReader so that you have one Document object per PDF?

MMebster

You mean merging Document objects?

LLogan M

yea, since the loader is splitting into a Document object per page

MMebster

I'm not sure how to do that. I could create a new Document object and assign these Documents as child_nodes, but that doesn't seem very idiomatic. Any idea?

LLogan M

something like

Plain Text

combined_documents = {}
for document in documents:
  if document.metadata['file_name'] not in combined_documents:
    combined_documents[document.metadata['file_name']] = document
  else:
    combined_documents[document.metadata['file_name']].text += "\n" + document.text

MMebster

Thanks. Here's what I've got:

def summarise_top_document(document_path: Path, index: DocumentSummaryIndex) -> None:
"""
Create a top PDF summary. This is done starting from a document path, extract the document node summaries and
merge them into a single document. The metadata of the first node is used as reference for the top document.
The resulting summary node is inserted into the index. Note that summary nodes are labelled as such via its
metadata.
param: document_path:
param: index:
"""
nodes: list[BaseNode] = _extract_document_node_summaries(document_path, index)
merged_text: str = "\n\n\n".join([n.get_content() for n in nodes])
ref_node: BaseNode = deepcopy(nodes[0])
top_doc: BaseNode = deepcopy(ref_node)
del top_doc.metadata["page_label"]
top_doc.metadata["summary"] = True
top_doc.set_content(merged_text)
top_doc.excluded_embed_metadata_keys = ref_node.excluded_embed_metadata_keys
top_doc.excluded_llm_metadata_keys = ref_node.excluded_llm_metadata_keys
d = Document(metadata=top_doc.metadata, # type: ignore
excluded_embed_metadata_keys=ref_node.excluded_embed_metadata_keys,
excluded_llm_metadata_keys=ref_node.excluded_llm_metadata_keys)
d.set_content(merged_text)
process_document_object(d, filter="summary")

MMebster

I wanted to make sure the metadata was coherent, such that the retreiver would work correctly. It does!! Happy days 😛

Add a reply

Find answers from the community

Hello. I'm using the