The community member is asking about implementing a BM25 retriever from an existing ChromadbVectorStore object for advanced RAG retrieval, or how to get a docstore object from an existing ChromadbVectorStore or retrieve all nodes from it. Another community member suggests that the list of all existing nodes can be retrieved from the ChromaVectorStore object using chrome_vector_store.client.get().documents.
The community member then mentions having a large ChromaDB file (163GB) and that creating a new BM25 retriever from it takes a long time. The other community members suggest that this is normal due to the size of the corpus, as it tokenizes nodes one by one. They also discuss whether the corpus is persisted or if only new nodes are processed on update, which could affect the performance on subsequent runs.
The community members discuss a potential solution to prepare and store the BM25 retriever data to disk, so that it doesn't take as long to create the retriever the next time. However, there is no explicitly marked answer in the comments.
Hi, Is there any implementation mechanism in Llama-index to create a BM25 retriever from an already existing ChromadbVectoreStore object for advanced RAG retrieval?? Or how to get a docstore object from an existing ChromadbVectoreStore or how to retrieve all nodes from an already existing ChromadbVectoreStore ?
Thanks @Rohan. One more question, I have relatively a large crhomadb (163GB) file. I've tried to create a new BM25 retriever from that chromadb text based on your help but it takes too long to create the new retriever, is it normal due to the size of the chromadb file?
I'm not exactly sure if the corpus is persisted or the nodes are tokenized on every update. If only the new nodes are processed on update, then it'll not take as long as the first run