Find answers from the community

Updated 6 months ago

Raptor and BM25 metadata filter

At a glance

The post is a greeting to two community members, Logan M and WhiteFang_Jr. The comments discuss how to define metadata filters for Raptor and BM25 retrieval. Community members suggest that modifying the pack code to append other filters or using a post-processing step may be a better approach, as filtering based on metadata can be tricky with Raptor due to the way it retrieves based on cluster summaries. They also discuss ensuring the same amount of top k results and the fact that Raptor doesn't really work by top k, but rather retrieves the top k clusters, which can contain varying numbers of chunks.

Hey @Logan M @WhiteFang_Jr
b
L
8 comments
Is there a way to define metedata filter for Raptor and BM25 retrieval ?

Plain Text
                self.raptor_pack = RaptorPack(
                    documents=[],
                    embed_model=Settings.embed_model,
                    llm=self.llm,
                    vector_store=self.vector_store,
                    similarity_top_k=self.similarity_top_k,
                    mode="collapsed",
                    summary_module=self.summary_module,
                )

                self.bm25_retriever = BM25Retriever.from_defaults(
                    docstore=self.docstore, similarity_top_k=self.similarity_top_k
                )
                self.raptor_pack.retriever.filters = MetadataFilters(
                    filters=[ExactMatchFilter(key="namespace", value=self.namespace)]
                )

                self.fusion_retriever = QueryFusionRetriever(
                    [self.raptor_pack.retriever, self.bm25_retriever],
                    similarity_top_k=self.similarity_top_k,
                    num_queries=1,  # set this to 1 to disable query generation
                    mode="reciprocal_rerank",
                    use_async=True,
                    verbose=verbose,
                )
It's already using metadata filters to filter out the raptor clusters and hierarchies.

You'll have to modify the pack code to append other filters
Can I pass the custom filters somehow to filter on namesapces:
Plain Text
MetadataFilters(
                    filters=[ExactMatchFilter(key="namespace", value=self.namespace)]
                )


I had them defiend as:

Plain Text
document.metadata = {
                "filename": filename,
                "page_number": idx,
                "creation_date": current_date,
                "last_accessed_date": current_date,
                "last_modified_date": current_date,
                "namespace": self.namespace,
            }
@Logan M Do you think adding metadata filtering as a postprocessor would be an easier way to do it?
Filtering is pretty tricky with raptor, because we are retrieving based on summaries of clusters. One cluster could contain many namespaces in your example.

Probably a post processing step makes the most sense? Since you can really filter the clusters based on metadata, at least not easily
Is there something I can use here out of the box for post processing based on filters or do I manually do it just after the retrived nodes? And how do I ensure the same amount of top k results?

Plain Text
            query_bundle = QueryBundle(query_str=query)
            retrived_nodes = self.fusion_retriever.retrieve(query_bundle)
            recency_nodes = self.recency_postprocessor.postprocess_nodes(
                retrived_nodes, query_bundle=query_bundle
            )
            rerank_nodes = self.postprocessor.postprocess_nodes(
                nodes=recency_nodes, query_bundle=query_bundle
            )
Raptor doesn't really work by top k. It retrieves the top k clusters, but cluster can contain any number of chunks (up to some maximum)
Thanks Logan, this clears things up
Add a reply
Sign up and join the conversation on Discord