So I guess the real question is why does

At a glance

So I guess the real question is: why does the text_splitter behave this way, and how can I index metadata in my nodes so that it is searchable but not a part of this 'segmentation' process? This works like this:

Plain Text

    for idx,raw_document in enumerate(documents):
        turns = []
        raw_document = raw_document['raw_doc']
        ...

        metadata_core = {k:v for k,v in raw_document.items() if not '__' in k}
        excluded_keys = list(metadata_core.keys())
        document = Document(
            text=conversation,
            metadata_seperator="::",
            metadata_template="{key}=>{value}",
            text_template="Metadata: {metadata_str}\n-----\nContent: {content}"
        )
        formatted_documents.append(document)

        # process document by document so correct metadata
        # remains associated with the nodes
        raw_nodes = node_parser.get_nodes_from_documents(
            [document]
        )

        # now add the custom metadata
        for node in raw_nodes:
            node.metadata.update(metadata_core)
            node.excluded_llm_metadata_keys = excluded_keys
            node.excluded_embed_metadata_keys = excluded_keys
            formatted_nodes.append(node)

5 comments

LLogan M

Where is this code coming from? ngl I'm pretty confused on where the issue is hahaha

cchsurf

@Logan M I solved this in the end, with the above snippet. Previously I was trying to add the metadata_core contents to the Documents before runnning get_nodes_from_documents

cchsurf

but this does not work correctly (IMO) because it either tries to first stringify all the metadata and then apply the text splitter to everything, or it completely ignores the metadata and then you cannot use it in a query filter. In order to work around it I just stopped adding the metadata to the Documents, and add the metadata only after we have performed the node splitting.

cchsurf

Then I index the nodes after modification.

cchsurf

@Logan M Probably I shouldn't hijack this thread, but now I've moved on to one last issue I can't quite work out from the documentation yet:

nodes = retriever.retrieve(args.query)
# now filter the nodes somehow, e.g. make sure we use only
# the 'best' result from each unique document
# Create a query engine that only searches certain footnotes.
filtered_query_engine = indexes[args.index].as_query_engine(
filters=meta_filter
)
res = filtered_query_engine.query(args.query)
print(res.response)

I have a set of nodes that I retrieve with the retriever, but as discussed previously, because there are often multiple nodes associated with a single conversation, it is possible that we get multiple 'hits' for the same conversation, instead of just the most closely fitting list of unique conversations. In order to hack around this limitation I filter the raw result of the raw retrieval so that we only have these desired nodes. However I don't see a way to manually provide this set of nodes to the query (so that it doesn't rerun the knn search)

Add a reply

Find answers from the community

So I guess the real question is why does