Find answers from the community

Updated 4 months ago

Im currently deploying celery as etl

At a glance

Im currently deploying celery as etl management system, what do you think about using ingestion pipeline workers?

11 comments

I think that makes sense! I did another project where I just had workers running in EKS that pulled from rabbitmq to process, makes a lot of sense

LLogan M

As to your other point, assuming your loaded documents have a consistent doc_id, you can attach a docstore + vector store to the ingestion pipeline (redis, mongodb, firestore, postgres) and then manage upserts that way

eemmepra

Just browing your repo! Amazing work man. Only thing I need to really take care of is pulling rate to meet concurrency limits (HF inference server sets it to 512 conc requests), do you have any feedback on this end?

eemmepra

Have you managed this issue?

LLogan M

hmm, my guess is to wrap requests with tenacity, using exponential backoff

LLogan M

could do this with a custom transformation that wraps your embeddings call in the pipeline

eemmepra

Thanks again! One last question, i swear. If my documents are continuously addeded to the Vector DB as they’re processed in the ingestion pipeline, is there any way to update/refresh on a live basis also the associated index? This is needed to smoothly allow RAG over new documents too, but Im not able to find anything similar from llamaindex docs, it seems like indexing is kinda static concept atm

LLogan M

If you are using a remote vector store, it should be automatically synced

eemmepra

Really? VectorStoreIndexes connected to for instance Chromadb are auto synched? Wow!!

LLogan M

if its a remote server, then yes 🙂 Since the index is just an API connection

eemmepra

It s a docker container at the moment, think it’s kind of remote server. Cool!

Add a reply