The community member is building a hybrid Q&A RAG pipeline using semantic and keyword search over a set of documents. They want to store the StorageContext in advance to improve processing time. The community member has questions about where to store the different components of the StorageContext (index_store, vector_store, and docstore) and where to store the SimpleKeywordTableIndex for keyword search.
The comments suggest using a vector database for faster storage, but also mention that other storage methods like S3 buckets can be used. The community member is concerned about the cost of using hybrid search options provided by some vector database providers and wants to use open-source and free options as much as possible.
One community member suggests storing the SimpleKeywordTableIndex in the index_store, which can be stored either in-memory (then persisted to disk) or in cloud storage options.
I am building a hybrid Q&A RAG pipeline (using semantic and keyword search) over a set of documents. Currently, it takes too long to answer a question. I want to store StorageContext in advance to improve processing time. Is that a good practice? What are some things I need to keep in mind for this purpose? Some other questions I have:
1) I understand that StorageContext has 4 components: index_store, vector_store, graph_store, and docstore. For my use case, there's no graph_store. Where can I store the remaining 3 stores? Is it a best practice to store all of them in a vector database?
2) I am using SimpleKeywordTableIndex for keyword search. Where can I store this index if I want to do it in advance? Can this also be stored in a vector database?
I would really appreciate if you can point me to a documentation around this use case. Thanks!
I referred to secinsights.ai but it uses Postgres for vector database and AWS S3 bucket to store StorageContext. The difference in my use case is that I am using both semantic and keyword search vs just semantic search in secinsights. Will I need to use a vector database for semantic search and store StorageContext separately in a s3 bucket? Is that the most efficient option?
I don't want to use hybrid search options provided by some vector db providers such as pinecone and weaviate because that'll come with extra cost. I want to keep the cost minimum and use open-source and free options as much as I can. Does that make sense?