Find answers from the community

Home
Members
Anatole
A
Anatole
Offline, last seen 3 months ago
Joined September 25, 2024
Hey @Logan M ,
I want to parse a PDF document using LLamaParse followed by a MarkdownElementNodeParser. But then I can't use this transformation in a DocumentSummaryIndex because it throws me a ValueError("ref_doc_id of node cannot be None when building a document summary index")

This is because the IndexNodes produced by MarkdownElementNodeParser do not have a ref_doc_id, and more importantly this ref_doc_id is different across all TextNodes even if they belong to the same source document. As a result, the DocumentSummaryIndex do not produce an overall summary.

Here is my code:

Plain Text
parser = LlamaParse(
    result_type="markdown",
    num_workers=4,
    verbose=True,
    language="en",
)
file_extractor = {".pdf": parser}

documents = SimpleDirectoryReader(
    input_dir="./pdf_documents/", file_extractor=file_extractor
).load_data()

def document_splitter(documents, **kwargs):
    node_parser = MarkdownElementNodeParser(
        llm=OpenAI(model=OPENAI_MODEL_NAME), num_workers=8
    )

    nodes = node_parser.get_nodes_from_documents(documents=documents)
    base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
    return base_nodes + objects

DocumentSummaryIndex.from_documents(
        documents,
        transformations=[
            document_splitter,
        ],
    )


Do you see any reasons why ? Can you think of a workaround ?
3 comments
L
A