Raptor and BM25 metadata filter

At a glance

The post is a greeting to two community members, Logan M and WhiteFang_Jr. The comments discuss how to define metadata filters for Raptor and BM25 retrieval. Community members suggest that modifying the pack code to append other filters or using a post-processing step may be a better approach, as filtering based on metadata can be tricky with Raptor due to the way it retrieves based on cluster summaries. They also discuss ensuring the same amount of top k results and the fact that Raptor doesn't really work by top k, but rather retrieves the top k clusters, which can contain varying numbers of chunks.

bbeaverTango

Hey @Logan M @WhiteFang_Jr

8 comments

bbeaverTango

Is there a way to define metedata filter for Raptor and BM25 retrieval ?

Plain Text

                self.raptor_pack = RaptorPack(
                    documents=[],
                    embed_model=Settings.embed_model,
                    llm=self.llm,
                    vector_store=self.vector_store,
                    similarity_top_k=self.similarity_top_k,
                    mode="collapsed",
                    summary_module=self.summary_module,
                )

                self.bm25_retriever = BM25Retriever.from_defaults(
                    docstore=self.docstore, similarity_top_k=self.similarity_top_k
                )
                self.raptor_pack.retriever.filters = MetadataFilters(
                    filters=[ExactMatchFilter(key="namespace", value=self.namespace)]
                )

                self.fusion_retriever = QueryFusionRetriever(
                    [self.raptor_pack.retriever, self.bm25_retriever],
                    similarity_top_k=self.similarity_top_k,
                    num_queries=1,  # set this to 1 to disable query generation
                    mode="reciprocal_rerank",
                    use_async=True,
                    verbose=verbose,
                )

LLogan M

It's already using metadata filters to filter out the raptor clusters and hierarchies.

You'll have to modify the pack code to append other filters

bbeaverTango

Can I pass the custom filters somehow to filter on namesapces:

Plain Text

MetadataFilters(
                    filters=[ExactMatchFilter(key="namespace", value=self.namespace)]
                )

I had them defiend as:

Plain Text

document.metadata = {
                "filename": filename,
                "page_number": idx,
                "creation_date": current_date,
                "last_accessed_date": current_date,
                "last_modified_date": current_date,
                "namespace": self.namespace,
            }

bbeaverTango

@Logan M Do you think adding metadata filtering as a postprocessor would be an easier way to do it?

LLogan M

Filtering is pretty tricky with raptor, because we are retrieving based on summaries of clusters. One cluster could contain many namespaces in your example.

Probably a post processing step makes the most sense? Since you can really filter the clusters based on metadata, at least not easily

bbeaverTango

Is there something I can use here out of the box for post processing based on filters or do I manually do it just after the retrived nodes? And how do I ensure the same amount of top k results?

Plain Text

            query_bundle = QueryBundle(query_str=query)
            retrived_nodes = self.fusion_retriever.retrieve(query_bundle)
            recency_nodes = self.recency_postprocessor.postprocess_nodes(
                retrived_nodes, query_bundle=query_bundle
            )
            rerank_nodes = self.postprocessor.postprocess_nodes(
                nodes=recency_nodes, query_bundle=query_bundle
            )

LLogan M

Raptor doesn't really work by top k. It retrieves the top k clusters, but cluster can contain any number of chunks (up to some maximum)

bbeaverTango

Thanks Logan, this clears things up

Add a reply

Find answers from the community

Raptor and BM25 metadata filter