afternoon builders :salutecoffee:

At a glance

afternoon builders

i'm working to have a shared storage space amongst tools so i'm passing the same instance of a StorageContext. i'd like to persist (to the shared StorageContext) original document nodes, chunked/metadata-enhaced nodes, and indexes [over the node stores] in the ingestion tool to read them elsewhere. i'm trying to understand how the different storage objects in the context are managed

ingestion tool reads files to get document nodes (manually adding these docs to docstore after getting the document nodes, since there's no storage_context arg to SimpleDirectoryReader)
ingestion tool then chunks them (i'm passing a vector_store here but should i also update the docstore? will vector_stores[vector_store.name] automatically add the chunks/nodes from transformation? will it only do that if one of the transformation steps adds embeddings to each node?
the ingestion tool then creates an index over the store with the chunk nodes (i've been struggling to then access this index later, should i be manually setting a new index with add_index_struct and read it later by rebuilding an index from an index struct?)

16 comments

LLogan M

hmm, so the ingestion pipeline is only storing the hashes and (optionally, turned on by default), the original full documents in the docstore. This is mostly for upsert/deduplication abilities.

Most vector store integrations store the nodes in the vector store itself.

Tbh, you might not need a storage context at all, depending on the vector store you are using?

eenginirmata 🐲

ah so use one vector store for all ingested nodes

eenginirmata 🐲

loading: self._storage_context.docstore.add_documents(nodes)

querying: nodes = list(self._storage_context.docstore.docs.values())

is what i'm doing now and building new index whenever querying

LLogan M

oh interesting 👀 Are you using the default vector store then? Or something like qdrant, weaviate. etc?

eenginirmata 🐲

chroma

eenginirmata 🐲

Plain Text

        LoadCodeTool(leader_storage_context, leader_service_context),
        QueryCodeTool(leader_storage_context, leader_service_context),

LLogan M

nice -- so technically, you don't need to rebuild your index from nodes every time

LLogan M

Since the nodes are in chroma, and chroma is typically persisted automatically (from my understanding), you can setup the vector store object to point to an existing vector store, and do something like

index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)

eenginirmata 🐲

the docs will need to be in the vector store before running that right (add the vector_store arg to the ingestion)?

LLogan M

right 👍 So the flow might be

Plain Text

pipeline = IngestionPipeline(..., vector_store=vector_store)
pipeline.run(documents=documents)
index = VectorStoreIndex.from_vector_store(vector_store, service_context=service_context)

eenginirmata 🐲

will that vector_store only upsert the nodes during ingestion if they have embeddings? my understanding was the until i call a XIndex wrapper, there won't be embeddings on nodes unless i add it to transformers

eenginirmata 🐲

possibly related (or indicative i'm off track) is there a generic Embed transformer that i can add to the end of the ingestion pipeline to attach node embeddings?

docs are clear how i would write that, curious if there's a native one

eenginirmata 🐲

would have to have a generically applied metadata embed strategy so maybe a blanket Embed transform like that wouldn't work well across nodes with different content+metadata embed strats

LLogan M

Right, the pipeline needs to have embeddings in the transformations,

For example

Plain Text

pipeline = IngestionPipeline(transformations=[..., OpenAIEmbedding()])

Every embedding model class extends the base transform component class

eenginirmata 🐲

oh i can apply just like that!

eenginirmata 🐲

v sweet

Add a reply

Find answers from the community

afternoon builders :salutecoffee: