Hello Llamaindex,

At a glance

Hello Llamaindex,

I've developed a custom MetadataAwareTextSplitter, for precise control over the content segmentation into chunks. This splitter processes JSON data, extracting various elements like title, description, component status, search tags, and different section types. I've successfully saved the generated nodes in a Postgres-based Vector DB.

However, I'm facing a challenge with metadata association for each node. Currently, it seems that the nodes only inherit metadata from their parent document instance. Is there a way to assign unique metadata to each node, reflecting their specific content and characteristics, rather than just inheriting from the parent document?

13 comments

LLogan M

What version of llama-index are you on?

In version v0.9, text splitters actually are subclasses of node parsers now. This has a few implications that I think will be helpful for you

If you look at the base NodeParser class in def get_nodes_from_documents(), it's actually calling node.metadata.update(parent_metadata) -- you can enable or disable this. Or even change the default in your subclass.

But further more, since it's using update(), your text splitter can also assign metadata to nodes as it is processing, and it will propagate to the output just fine

LLogan M

Your exact use-case is part of why we made this change in the first place

JJianL

Thanks Logan, I am on 0.8.47. I'll take a look tomorrow.

iianuvrat

get_nodes_from_documents , can we parse documents t nodes with this using open source LLM? If yes, is there any code?

LLogan M

Using an LLM to create nodes? That's not quite a thing yet (and would be really expensive tbh).

No code for it, but you could extend the base class and make a PR or example if you wanted 😃

JJianL

@Logan M I've upgraded to 0.9.3. I don't want to override the whole get_nodes_from_documents function. I would love to learn more detail about:

But further more, since it's using update(), your text splitter can also assign metadata to nodes as it is processing, and it will propagate to the output just fine

JJianL

The only function my splitter implemented is def split_text_metadata_aware(self, text: str, metadata_str: str) -> List[str]:

LLogan M

Right! The function you really want to additionally modify is here

https://github.com/run-llama/llama_index/blob/454f49d432acb3d91294674b2bc583a895c1db18/llama_index/node_parser/interface.py#L140

LLogan M

Since that one actually returns nodes

LLogan M

You could modify it there or in the base NodeParser class, up to you

JJianL

@Logan M Thank you for the help. Override _parse_nodes(...) works for me.

Here is the code:

Plain Text

class DirectusSplitter(MetadataAwareTextSplitter):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    @classmethod
    def class_name(cls) -> str:
        return "DirectusSplitter"

    # Implement the abstract method to make python runtime happy
    def split_text_metadata_aware(self, text: str, metadata_str: str) -> List[str]:
        return []

    # Implement the abstract method to make python runtime happy
    def split_text(self, text: str) -> List[str]:
        return []

    def split_text_metadata_aware(self, text: str, metadata_str: str):
        chunks = []
        additional_metadatas = []
        # ... split text into chunks and populate additional_metadatas
        return chunks, additional_metadatas

    def _parse_nodes(self, nodes: Sequence[BaseNode], show_progress: bool = False, **kwargs: Any) -> List[BaseNode]:
        all_nodes: List[BaseNode] = []
        nodes_with_progress = get_tqdm_iterable(
            nodes, show_progress, "Parsing nodes")

        for node in nodes_with_progress:
            metadata_str = self._get_metadata_str(node)
            result = self.split_text_metadata_aware(
                node.get_content(metadata_mode=MetadataMode.NONE),
                metadata_str=metadata_str,
            )
            additional_metadatas = result[1]
            nodes = build_nodes_from_splits(result[0], node)
            for i in range(len(nodes)):
                nodes[i].metadata.update(additional_metadatas[i])
            all_nodes.extend(nodes)

        return all_nodes

LLogan M

Awesome!

JJianL

@ianuvrat I actually like your idea. Text chunking myself is such a painful process. Teach a local mini LLM to do it make sense to me :). It is expensive, but only for indexing phrase should be fine.

Add a reply

Find answers from the community

Hello Llamaindex,