Hi, Is there any implementation

At a glance

The community member is asking about implementing a BM25 retriever from an existing ChromadbVectorStore object for advanced RAG retrieval, or how to get a docstore object from an existing ChromadbVectorStore or retrieve all nodes from it. Another community member suggests that the list of all existing nodes can be retrieved from the ChromaVectorStore object using chrome_vector_store.client.get().documents.

The community member then mentions having a large ChromaDB file (163GB) and that creating a new BM25 retriever from it takes a long time. The other community members suggest that this is normal due to the size of the corpus, as it tokenizes nodes one by one. They also discuss whether the corpus is persisted or if only new nodes are processed on update, which could affect the performance on subsequent runs.

The community members discuss a potential solution to prepare and store the BM25 retriever data to disk, so that it doesn't take as long to create the retriever the next time. However, there is no explicitly marked answer in the comments.

HHoaz

Hi, Is there any implementation mechanism in Llama-index to create a BM25 retriever from an already existing ChromadbVectoreStore object for advanced RAG retrieval?? Or how to get a docstore object from an existing ChromadbVectoreStore or how to retrieve all nodes from an already existing ChromadbVectoreStore ?

6 comments

RRohan

you should be able to get the list of all existing nodes from a ChromaVectorStore object like this:

Plain Text

chrome_vector_store.client.get().documents

HHoaz

Thanks @Rohan. One more question, I have relatively a large crhomadb (163GB) file. I've tried to create a new BM25 retriever from that chromadb text based on your help but it takes too long to create the new retriever, is it normal due to the size of the chromadb file?

RRohan

I haven't worked with that big corpus, but as it tokenizes nodes one by one, that's why it might be taking so long

Attachment

HHoaz

is there any solution that i can prepare and store the bm25 retriever data to the disk so that next time it won't take as long as the first time?

RRohan

I'm not exactly sure if the corpus is persisted or the nodes are tokenized on every update. If only the new nodes are processed on update, then it'll not take as long as the first run

HHoaz

thanks will test that with a small corpus

Add a reply

Find answers from the community

Hi, Is there any implementation