Hey @Logan M ,

At a glance

The community member is trying to parse a PDF document using LlamaParse followed by a MarkdownElementNodeParser, but they are encountering an issue when using the DocumentSummaryIndex. The issue is that the IndexNodes produced by the MarkdownElementNodeParser do not have a ref_doc_id, and this ref_doc_id is different across all TextNodes even if they belong to the same source document, preventing the DocumentSummaryIndex from producing an overall summary.

A community member suggests that it may not make sense to combine the MarkdownElementNodeParser and the DocumentSummaryIndex, but they provide a workaround by processing documents one at a time and manually setting the parent document for each node.

Another community member disagrees and feels that the MarkdownElementNodeParser and the DocumentSummaryIndex address different concerns, and they suggest that the issue with the unstable ref_doc_ids might be a bug that could be addressed with

AAnatole

Hey @Logan M ,
I want to parse a PDF document using LLamaParse followed by a MarkdownElementNodeParser. But then I can't use this transformation in a DocumentSummaryIndex because it throws me a ValueError("ref_doc_id of node cannot be None when building a document summary index")

This is because the IndexNodes produced by MarkdownElementNodeParser do not have a ref_doc_id, and more importantly this ref_doc_id is different across all TextNodes even if they belong to the same source document. As a result, the DocumentSummaryIndex do not produce an overall summary.

Here is my code:

Plain Text

parser = LlamaParse(
    result_type="markdown",
    num_workers=4,
    verbose=True,
    language="en",
)
file_extractor = {".pdf": parser}

documents = SimpleDirectoryReader(
    input_dir="./pdf_documents/", file_extractor=file_extractor
).load_data()

def document_splitter(documents, **kwargs):
    node_parser = MarkdownElementNodeParser(
        llm=OpenAI(model=OPENAI_MODEL_NAME), num_workers=8
    )

    nodes = node_parser.get_nodes_from_documents(documents=documents)
    base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
    return base_nodes + objects

DocumentSummaryIndex.from_documents(
        documents,
        transformations=[
            document_splitter,
        ],
    )

Do you see any reasons why ? Can you think of a workaround ?

3 comments

LLogan M

The way that the markdown element node parser works, I don't think it makes sense to combine with a document summary index 🤔 although, you could process documents one at a time and manually set the parent doc

Plain Text

from llama_index.core.schema import NodeRelationship, RelatedNodeInfo

def document_splitter(documents, **kwargs):
  node_parser = MarkdownElementNodeParser(
      llm=OpenAI(model=OPENAI_MODEL_NAME), num_workers=8
  )
  
  all_nodes = []
  for document in documents:
    nodes = node_parser.get_nodes_from_documents(documents=documents)
    base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
   
    for node in base_nodes:
      node.relationships[NodeRelationship.PARENT] = RelatedNodeInfo(
        node_id=document.id_,
      )

    all_nodes.extend(base_nodes)
    all_nodes.extend(objects)
  return all_nodes

AAnatole

Why do you think it does not make sense to combine a markdown element node parser and a document summary index ? I feel they address 2 different concerns : the markdown element node parser gracefully splits the result of LlamaParse and will correctly handle tables in while the document summary index generates a summary of the whole document
Thanks for the answer anyway !
Don't you think it's a bug that markdown element node parser creates unstable ref_doc_ids ? If so I may open an issue and propose a PR

LLogan M

You can definitely open a PR yea

Add a reply

Find answers from the community

Hey @Logan M ,