Why does the RAG pipeline presented in

At a glance

Why does the RAG pipeline presented in most documentation examples seems wrong?

I've been studying these various frameworks like llama-index and langchain recently. I've noticed that all the examples they give are like this:

read a document from a folder/source -> create a "vector store" based on the documents -> querying the llm using the context of the created vector store

In most cases, we have to generate embeddings of the documents (or nodes) when ingesting them inside the vector store. Isn't this approach kind of not optimal? Because when the application is off, I'd have to generate the embeddings of the same documents over and over again, generating waste of resources.

Note that I didn't send any piece of code here, I'm talking about the general workflow.

I've found a way to do the ingestion phase and the QA phase separately in langchain. I'm now migrating this method to Llamaindex, which I think will be more optimal because I really liked what llamaparse does for better parsing documents. The ingestion is done separately from the QA and all of them are asynchronous processes.

Is this process presented by most documentations really wrong or am I missing something? I'm still studying the documentations but slowly because this is not my priority right now.

22 comments

LLogan M

Because when the application is off, I'd have to generate the embeddings of the same documents over and over again, -- why?

You can generate embeddings for your text chunks, save them in some vector db, and off you go

LLogan M

curious what the confusion is 😅

LLogan M

Once its saved in the vector db, you can just use that same db again without recalculating anything

Find answers from the community

Why does the RAG pipeline presented in