Find answers from the community

Updated 3 months ago

I'm having an issue getting the

I'm having an issue getting the ingestion pipeline getting to work with Weaviate + Redis Cache (so that I can ingest new documents later).

I had to add index=VectorStoreIndex(nodes, storage_context = storage_context) to get it to load into weaviate but now it seems that the Cache is not taken into account. Without the index line it seemed like it was processing data but never loading.

When using Chroma I was able to get the ingestion pipeline + cache to work (everything the same minus the index = line.

Plain Text
pipeline = IngestionPipeline(
        transformations=[
            SimpleNodeParser(),
#            SentenceSplitter(chunk_size=512, chunk_overlap=20),
#            TitleExtractor(nodes=5),
#            SummaryExtractor(summaries=["prev", "self"]),
#            KeywordExtractor(keywords=10),
#            OpenAIEmbedding(),
        ],
        vector_store=vector_store,
        cache=ingest_cache,
    )


    nodes = pipeline.run(documents=documents, storage_context=storage_context)

    index = VectorStoreIndex(nodes, storage_context = storage_context)
L
c
20 comments
I think the issue is you've commented out the embeddings

Plain Text
pipeline = IngestionPipeline(
    transformations=[
        SimpleNodeParser(),
        OpenAIEmbedding(),
    ],
    vector_store=vector_store,
    cache=ingest_cache,
)

nodes = pipeline.run(documents=documents)


This should work
Thanks Logan. I think I tried this but will try exactly your recommendation to confirm

I was confused about letting llamaindex do the embeddings vs weaviate do the embeddings. If one was preferred over the other.

I think I recall reading that if weaviate receives embeddings that it is smart enough to not "re-embed"
Then you can create your index with

index = VectorStoreIndex.from_vector_store(vector_store)
LlamaIndex handles the embeddings, for all vector stores
All the vector stores assume you have embeddings attached when calling vector_store.add() under the hood
But if I'm just loading the data into the vector store only at this point. Then I dont have to use VectorStoreIndex correct?

Only when I go to query?
with Chroma it was loading with just pipeline.run
Yea you are right, it will work the same as chroma, assuming your pipeline has embeddings
I just put the index thing above since it gives easy access to a bunch of stuff like query engines, chat engines, etc.
understood . Thanks for the help. Ill try again and reconfirm.

Is my assumption that weaviate will not try and to embed on its side if it receives embeddings correct?
yea thats right. Specifically, the code for inserting into weaviate is here. It (probably) won't work if there's no embeddings

https://github.com/run-llama/llama_index/blob/50b9f75461ec7a5baa625126d071f76ea5dc5a5d/llama_index/vector_stores/weaviate_utils.py#L141
okay. Ill try your recommendations.
I can confirm it does work, the code up above is what I am using and it works.

Just not the caching part.

The weaviate docker set up with
ENABLE_MODULES: 'text2vec-openai,reranker-transformers, generative-openai'
Appreciate your help Logan
How are you expecting the caching to work?

How it DOES work is that it caches a hash of the each transform step + input nodes. If there is a cache hit, that transform step is skipped and the cached results are used.

It will still fully run the pipeline, inserting nodes into your vector db
If you are trying to avoid duplicate inserts, look into attaching a docstore
https://docs.llamaindex.ai/en/stable/examples/ingestion/document_management_pipeline.html
That makes sense . I'm still learning
yea no worries πŸ™‚
Hey Logan - Quick Q after reading about document stores.

So if I go this route is it recommended just to drop the RedisCache method? Would there be any purpose for it if I was just using it to ensure duplicate documents wer not loaded?
ah.. I think this is an example I should follow
https://docs.llamaindex.ai/en/stable/examples/ingestion/redis_ingestion_pipeline.html

Docstore+Cache in Redis. And Vector Store Weaviate
Attachment
image.png
Yea the cache and docstore have slightly different use cases. But sounds like you got it πŸ™‚
Add a reply
Sign up and join the conversation on Discord