Constructing a Document from a List of TextNode

At a glance

The community member has a Markdown document that they have broken down into a list of TextNode objects using the MarkdownNodeParser. They were previously using the node.metadata['Header_1'] to filter the nodes, but after updating the llama-index-core module, this metadata is no longer available. The community member is now manually adding the header information back, but is unsure how to convert the list of TextNode objects back into a Document object.

In the comments, another community member suggests that the metadata can be added directly to the Node object, and provides an example of how to do this. Another community member mentions that the new code adds the header_path metadata, which can be used instead.

The original community member then asks if there is a way to combine multiple TextNode objects into a single Document object, as this was not an issue before the update. Another community member suggests that while there may not be a direct method for this, the community member can iterate over the nodes, stitch the text together, copy the metadata, and create a final Document

ggalvangjx

hello there, is there a way to construct a Document from a list of TextNode?

I have a markdown document, from LlamaParse, where I break them down into a list of nodes using MarkdownNodeParser, utilising node.metadata['Header_1] as a way of filtering those nodes by the md headers from my document, and do text amendment.

Now that I have updated llama-index-core, node.metadata dictionary is missing the Header_1. What I do now is manually add them back, but I'm stuck with a list of updated TextNode, not knowing how to convert them into a Document.

9 comments

WWhiteFang_Jr

Hey, you can add metadata directly to Node itself, But not quite sure on what you mean with this: Now that I have updated llama-index-core, node.metadata dictionary is missing the Header_1

But to give you an idea how you can add metadata to a node:

Plain Text

from llama_index.core.schema import TextNode
node1 = TextNode(text="<text_chunk>", id_="<node_id>")
node1.metadata['Header_1'] = 'ADD_HEARER'

SSaltuk

Ah okay i see

SSaltuk

The new code adds the metadata header_path you can just try accessing this instead i suppose. If there is no header in a section then it is just / else it is an actual path like for example /1. Introduction/1.1 Subsection.

ggalvangjx

Yeah that is what I do. My question was, if i have like multiple TextNode, is there a way to combine them into a Document object?

Before I updated the llama-index-core module, this wasn't an issue.

ggalvangjx

With a Document object, you can do

Plain Text

from llama_index.core.node_parser import MarkdownNodeParser
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents([Document])

And nodes will be a list of TextNode. Is there a way to combine them back into a Document object? Like an inverse transform operation.

ggalvangjx

Hope my questions is clear, thanks!

WWhiteFang_Jr

I dont think there is a method for inverse transform but You can iterate over the nodes, stitch the text together, copy the metadata and create a final Document.

Plain Text

text = ''
metadata = []
for node in nodes:
  text = text + node.text
  metadata = metadata.append(node.metadata)

# now form the document object using the text and metdata
doc = Document(text=text, metadata=metadata)

ggalvangjx

ah this is something that I am looking for. didn't know metadata for Document can be added in similar ways as TextNode.

ggalvangjx

I will give it a try, thanks @WhiteFang_Jr !

Add a reply

Find answers from the community

Constructing a Document from a List of TextNode