Find answers from the community

Updated 3 months ago

afternoon builders :salutecoffee:

afternoon builders

i'm working to have a shared storage space amongst tools so i'm passing the same instance of a StorageContext. i'd like to persist (to the shared StorageContext) original document nodes, chunked/metadata-enhaced nodes, and indexes [over the node stores] in the ingestion tool to read them elsewhere. i'm trying to understand how the different storage objects in the context are managed

  1. ingestion tool reads files to get document nodes (manually adding these docs to docstore after getting the document nodes, since there's no storage_context arg to SimpleDirectoryReader)
  2. ingestion tool then chunks them (i'm passing a vector_store here but should i also update the docstore? will vector_stores[vector_store.name] automatically add the chunks/nodes from transformation? will it only do that if one of the transformation steps adds embeddings to each node?
  3. the ingestion tool then creates an index over the store with the chunk nodes (i've been struggling to then access this index later, should i be manually setting a new index with add_index_struct and read it later by rebuilding an index from an index struct?)
L
e
16 comments
hmm, so the ingestion pipeline is only storing the hashes and (optionally, turned on by default), the original full documents in the docstore. This is mostly for upsert/deduplication abilities.

Most vector store integrations store the nodes in the vector store itself.

Tbh, you might not need a storage context at all, depending on the vector store you are using?
ah so use one vector store for all ingested nodes
loading: self._storage_context.docstore.add_documents(nodes)

querying: nodes = list(self._storage_context.docstore.docs.values())

is what i'm doing now and building new index whenever querying
oh interesting πŸ‘€ Are you using the default vector store then? Or something like qdrant, weaviate. etc?
Plain Text
        LoadCodeTool(leader_storage_context, leader_service_context),
        QueryCodeTool(leader_storage_context, leader_service_context),
nice -- so technically, you don't need to rebuild your index from nodes every time
Since the nodes are in chroma, and chroma is typically persisted automatically (from my understanding), you can setup the vector store object to point to an existing vector store, and do something like

index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)
the docs will need to be in the vector store before running that right (add the vector_store arg to the ingestion)?
right πŸ‘ So the flow might be

Plain Text
pipeline = IngestionPipeline(..., vector_store=vector_store)
pipeline.run(documents=documents)
index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)
will that vector_store only upsert the nodes during ingestion if they have embeddings? my understanding was the until i call a XIndex wrapper, there won't be embeddings on nodes unless i add it to transformers
possibly related (or indicative i'm off track) is there a generic Embed transformer that i can add to the end of the ingestion pipeline to attach node embeddings?

docs are clear how i would write that, curious if there's a native one
would have to have a generically applied metadata embed strategy so maybe a blanket Embed transform like that wouldn't work well across nodes with different content+metadata embed strats
Right, the pipeline needs to have embeddings in the transformations,

For example

Plain Text
pipeline = IngestionPipeline(transformations=[..., OpenAIEmbedding()])


Every embedding model class extends the base transform component class
oh i can apply just like that!
Add a reply
Sign up and join the conversation on Discord