Find answers from the community

Updated 9 months ago

Hey @Logan M ,

Hey @Logan M ,
I want to parse a PDF document using LLamaParse followed by a MarkdownElementNodeParser. But then I can't use this transformation in a DocumentSummaryIndex because it throws me a ValueError("ref_doc_id of node cannot be None when building a document summary index")

This is because the IndexNodes produced by MarkdownElementNodeParser do not have a ref_doc_id, and more importantly this ref_doc_id is different across all TextNodes even if they belong to the same source document. As a result, the DocumentSummaryIndex do not produce an overall summary.

Here is my code:

Plain Text
parser = LlamaParse(
    result_type="markdown",
    num_workers=4,
    verbose=True,
    language="en",
)
file_extractor = {".pdf": parser}

documents = SimpleDirectoryReader(
    input_dir="./pdf_documents/", file_extractor=file_extractor
).load_data()

def document_splitter(documents, **kwargs):
    node_parser = MarkdownElementNodeParser(
        llm=OpenAI(model=OPENAI_MODEL_NAME), num_workers=8
    )

    nodes = node_parser.get_nodes_from_documents(documents=documents)
    base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
    return base_nodes + objects

DocumentSummaryIndex.from_documents(
        documents,
        transformations=[
            document_splitter,
        ],
    )


Do you see any reasons why ? Can you think of a workaround ?
L
A
3 comments
The way that the markdown element node parser works, I don't think it makes sense to combine with a document summary index πŸ€” although, you could process documents one at a time and manually set the parent doc

Plain Text
from llama_index.core.schema import NodeRelationship, RelatedNodeInfo

def document_splitter(documents, **kwargs):
  node_parser = MarkdownElementNodeParser(
      llm=OpenAI(model=OPENAI_MODEL_NAME), num_workers=8
  )
  
  all_nodes = []
  for document in documents:
    nodes = node_parser.get_nodes_from_documents(documents=documents)
    base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
   
    for node in base_nodes:
      node.relationships[NodeRelationship.PARENT] = RelatedNodeInfo(
        node_id=document.id_,
      )

    all_nodes.extend(base_nodes)
    all_nodes.extend(objects)
  return all_nodes
Why do you think it does not make sense to combine a markdown element node parser and a document summary index ? I feel they address 2 different concerns : the markdown element node parser gracefully splits the result of LlamaParse and will correctly handle tables in while the document summary index generates a summary of the whole document
Thanks for the answer anyway !
Don't you think it's a bug that markdown element node parser creates unstable ref_doc_ids ? If so I may open an issue and propose a PR
You can definitely open a PR yea
Add a reply
Sign up and join the conversation on Discord