Find answers from the community

Updated 3 months ago

Raptor and BM25 metadata filter

Hey @Logan M @WhiteFang_Jr
b
L
8 comments
Is there a way to define metedata filter for Raptor and BM25 retrieval ?

Plain Text
                self.raptor_pack = RaptorPack(
                    documents=[],
                    embed_model=Settings.embed_model,
                    llm=self.llm,
                    vector_store=self.vector_store,
                    similarity_top_k=self.similarity_top_k,
                    mode="collapsed",
                    summary_module=self.summary_module,
                )

                self.bm25_retriever = BM25Retriever.from_defaults(
                    docstore=self.docstore, similarity_top_k=self.similarity_top_k
                )
                self.raptor_pack.retriever.filters = MetadataFilters(
                    filters=[ExactMatchFilter(key="namespace", value=self.namespace)]
                )

                self.fusion_retriever = QueryFusionRetriever(
                    [self.raptor_pack.retriever, self.bm25_retriever],
                    similarity_top_k=self.similarity_top_k,
                    num_queries=1,  # set this to 1 to disable query generation
                    mode="reciprocal_rerank",
                    use_async=True,
                    verbose=verbose,
                )
It's already using metadata filters to filter out the raptor clusters and hierarchies.

You'll have to modify the pack code to append other filters
Can I pass the custom filters somehow to filter on namesapces:
Plain Text
MetadataFilters(
                    filters=[ExactMatchFilter(key="namespace", value=self.namespace)]
                )


I had them defiend as:

Plain Text
document.metadata = {
                "filename": filename,
                "page_number": idx,
                "creation_date": current_date,
                "last_accessed_date": current_date,
                "last_modified_date": current_date,
                "namespace": self.namespace,
            }
@Logan M Do you think adding metadata filtering as a postprocessor would be an easier way to do it?
Filtering is pretty tricky with raptor, because we are retrieving based on summaries of clusters. One cluster could contain many namespaces in your example.

Probably a post processing step makes the most sense? Since you can really filter the clusters based on metadata, at least not easily
Is there something I can use here out of the box for post processing based on filters or do I manually do it just after the retrived nodes? And how do I ensure the same amount of top k results?

Plain Text
            query_bundle = QueryBundle(query_str=query)
            retrived_nodes = self.fusion_retriever.retrieve(query_bundle)
            recency_nodes = self.recency_postprocessor.postprocess_nodes(
                retrived_nodes, query_bundle=query_bundle
            )
            rerank_nodes = self.postprocessor.postprocess_nodes(
                nodes=recency_nodes, query_bundle=query_bundle
            )
Raptor doesn't really work by top k. It retrieves the top k clusters, but cluster can contain any number of chunks (up to some maximum)
Thanks Logan, this clears things up
Add a reply
Sign up and join the conversation on Discord