is there a way to do hybrid retriever

At a glance

The post asks if there is a way to do a hybrid retriever/BM25 retriever without storing and loading from the filesystem (docstore/objectstore). The comments suggest that it is possible to manually write the BM25 algorithm to generate sparse vectors and store them in a vector store, but the problem is that if any text is added or removed, the entire index needs to be recalculated. One community member mentions switching to the newer and faster BM25s library to allow saving the retriever to disk. Another community member notes that the BM25s library is very small and should be faster than SPLADE, but there is not much information on benchmarking. The community members also discuss the need for some minor tweaks to properly support BM25 in llama-index, such as setting the IDF parameter in the config. Overall, there does not appear to be an explicitly marked answer to the original question.

VVi

is there a way to do hybrid retriever / bm25 retriever without storing and loading from filesystem (docstore / objectstore) ?

5 comments

LLogan M

If you manually wrote the bm25 algorithm to generate the sparse vectors and threw that into a vector store, then yes 🙂

The problem with bm25 is that if any text is added or removed, the entire index needs to be recalculated

I've been meaning to switch to the newer (and faster) bm25s library so that people can at least save the retriever to disk

LLogan M

Not sure. Its very small which is nice, so it should be much faster than splade. But they really didn't leave much info for benchmarking

LLogan M

In any case though, seems like some minor tweaks are needed to support it properly in llama-index (need to set that IDF param in the config for example)

VVi

Recalculating the whole index doesn't seem to nice.

VVi

I saw qdrant has bm25, but all is saved into the cloud not to disk as an docstore

Add a reply

Find answers from the community

is there a way to do hybrid retriever