Find answers from the community

Updated 5 months ago

Hello Llamaindex,

At a glance
Hello Llamaindex,

I've developed a custom MetadataAwareTextSplitter, for precise control over the content segmentation into chunks. This splitter processes JSON data, extracting various elements like title, description, component status, search tags, and different section types. I've successfully saved the generated nodes in a Postgres-based Vector DB.

However, I'm facing a challenge with metadata association for each node. Currently, it seems that the nodes only inherit metadata from their parent document instance. Is there a way to assign unique metadata to each node, reflecting their specific content and characteristics, rather than just inheriting from the parent document?
L
J
i
13 comments
What version of llama-index are you on?

In version v0.9, text splitters actually are subclasses of node parsers now. This has a few implications that I think will be helpful for you

If you look at the base NodeParser class in def get_nodes_from_documents(), it's actually calling node.metadata.update(parent_metadata) -- you can enable or disable this. Or even change the default in your subclass.

But further more, since it's using update(), your text splitter can also assign metadata to nodes as it is processing, and it will propagate to the output just fine
Your exact use-case is part of why we made this change in the first place
Thanks Logan, I am on 0.8.47. I'll take a look tomorrow.
get_nodes_from_documents , can we parse documents t nodes with this using open source LLM? If yes, is there any code?
Using an LLM to create nodes? That's not quite a thing yet (and would be really expensive tbh).

No code for it, but you could extend the base class and make a PR or example if you wanted πŸ˜ƒ
@Logan M I've upgraded to 0.9.3. I don't want to override the whole get_nodes_from_documents function. I would love to learn more detail about:
But further more, since it's using update(), your text splitter can also assign metadata to nodes as it is processing, and it will propagate to the output just fine
The only function my splitter implemented is def split_text_metadata_aware(self, text: str, metadata_str: str) -> List[str]:
Since that one actually returns nodes
You could modify it there or in the base NodeParser class, up to you
@Logan M Thank you for the help. Override _parse_nodes(...) works for me.

Here is the code:
Plain Text
class DirectusSplitter(MetadataAwareTextSplitter):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

    @classmethod
    def class_name(cls) -> str:
        return "DirectusSplitter"

    # Implement the abstract method to make python runtime happy
    def split_text_metadata_aware(self, text: str, metadata_str: str) -> List[str]:
        return []

    # Implement the abstract method to make python runtime happy
    def split_text(self, text: str) -> List[str]:
        return []

    def split_text_metadata_aware(self, text: str, metadata_str: str):
        chunks = []
        additional_metadatas = []
        # ... split text into chunks and populate additional_metadatas
        return chunks, additional_metadatas

    def _parse_nodes(self, nodes: Sequence[BaseNode], show_progress: bool = False, **kwargs: Any) -> List[BaseNode]:
        all_nodes: List[BaseNode] = []
        nodes_with_progress = get_tqdm_iterable(
            nodes, show_progress, "Parsing nodes")

        for node in nodes_with_progress:
            metadata_str = self._get_metadata_str(node)
            result = self.split_text_metadata_aware(
                node.get_content(metadata_mode=MetadataMode.NONE),
                metadata_str=metadata_str,
            )
            additional_metadatas = result[1]
            nodes = build_nodes_from_splits(result[0], node)
            for i in range(len(nodes)):
                nodes[i].metadata.update(additional_metadatas[i])
            all_nodes.extend(nodes)

        return all_nodes
@ianuvrat I actually like your idea. Text chunking myself is such a painful process. Teach a local mini LLM to do it make sense to me :). It is expensive, but only for indexing phrase should be fine.
Add a reply
Sign up and join the conversation on Discord