Hello I'm trying to develop an app with RAG pipeline with two parts: 1) Knowledge base builder - intakes documents from a storage andextracts embeddings from the documents. 2) Client part - takes user's question, returns list of the documents )
docstore = SimpleDocumentStore()
docstore.add_documents(documents)
vector_store = MilvusVectorStore(**config.get("vector_store"))
index_store = SimpleIndexStore()
storage_context = StorageContext.from_defaults(vector_store=vector_store,
index_store=index_store,
docstore=docstore)
embed_model = OpenAIEmbedding() if os.environ.get("OPENAI_API_KEY") else "local"
service_context = ServiceContext.from_defaults(embed_model=embed_model)
index = VectorStoreIndex.from_documents(documents=documents,
storage_context=storage_context,
service_context=service_context,
show_progress=True)
index.storage_context.persist(config.get("storage_context"))
docstore.persist(config.get("documents_store"))
Questions: 1) Do I need to parse my documents into nodes and add both docs and nodes to the docstore? 2.a) Does VectorStoreIndex.from_documents() parse nodes by default? 2.b) Can I force it to use documents? 2.c) Am I right that the only way to keep relations between nodes and documents is to parse them explicitly? 3) Do I need to pass documents to VectorStoreIndex even if storage_context is aware of docstore? 4) Does docstore update documents if I add documents with the same id, but diffennt contents? 5) Am I right that node_id is used to indentify vectors in my MilvusVectorStore? 6) Why do I need to persist storage_context if client part loads index using VectorStoreIndex.from_vector_store() ? What is correct way to save index and use it for quering?
You don't have to parse he documents on your own if you don't want to. You can simply pass the documents during Index creation and it will create nodes and add the nodes inside the docstore on its own.
2.a ) Yes 2.b) No 2.c) When you pass document at the time of index creation, Each document when being converted into nodes gets a property called include_prev_next_rel which keeps related nodes in sync.
Yes, If you have added new information in the docstore, The it is required to create embeddings for the same as well.
@Logan M and @WhiteFang_Jr thank you for answers. What is the key concept of VectorStoreIndex? I've read the docmentation but still can't get what this object is for.
Vector store index or be it any index, all of them serve similar purpose. The serving part can different index to index but core remains the same.
VectorStoreIndex stores nodes and embeddings to itself of the given documents. You can perform insertion/updation/deletion operations on the nodes using the index. Index also let you create query_engine instance which helps you to query your own data.