Hey @Logan M ,
I want to parse a PDF document using LLamaParse followed by a
MarkdownElementNodeParser
. But then I can't use this transformation in a
DocumentSummaryIndex
because it throws me a
ValueError("ref_doc_id of node cannot be None when building a document summary index")
This is because the IndexNodes produced by
MarkdownElementNodeParser
do not have a
ref_doc_id
, and more importantly this
ref_doc_id
is different across all TextNodes even if they belong to the same source document. As a result, the
DocumentSummaryIndex
do not produce an overall summary.
Here is my code:
parser = LlamaParse(
result_type="markdown",
num_workers=4,
verbose=True,
language="en",
)
file_extractor = {".pdf": parser}
documents = SimpleDirectoryReader(
input_dir="./pdf_documents/", file_extractor=file_extractor
).load_data()
def document_splitter(documents, **kwargs):
node_parser = MarkdownElementNodeParser(
llm=OpenAI(model=OPENAI_MODEL_NAME), num_workers=8
)
nodes = node_parser.get_nodes_from_documents(documents=documents)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
return base_nodes + objects
DocumentSummaryIndex.from_documents(
documents,
transformations=[
document_splitter,
],
)
Do you see any reasons why ? Can you think of a workaround ?