Hi,

At a glance

The community member has built two simple RAG (Retrieval-Augmented Generation) scripts, one using Langchain and the other using Llamaindex. They have indexed documents in a Chroma database using Langchain and the same embedding function. When evaluating the faithfulness of the responses, the community member found that the Langchain retriever had 80% faithfulness, while the Llamaindex retriever had only 20% faithfulness.

The community member suspects that the issue may be due to the document structure in Chroma, and they are trying to reindex everything using Llamaindex, but the corpus is large and it will take a day or two to have a replica database.

In the comments, other community members suggest that the community member should investigate the retrieved nodes and the responses to see if they make sense, as well as check the settings to ensure they are similar to Langchain. The community member later finds that the issue was caused by passing the custom prompt directly into the standard query engine query string, which confused the language model.

BBaygon

Hi,
I've built 2 simple RAG script. One in Langchain, one in Llamaindex:

Llamaindex: query_engine = index.as_query_engine()
Langchain:

    chain = load_qa_chain(llm, chain_type="stuff")
    res = chain.run(input_documents=docs, question=prompt)

Then I have a chromadb where docs have been indexed via langchain and the same embedding function.
Finally, in another script, I pass 100 questions, store context retrieved and responses, and have a custom prompt to evaluate the faithfulness of the response given question and context
I got very disturbing result with 80% faithfulness when using Langchain retriever, but only 20% when using Llamaindex.
I would assume that it could be because of the documents structure in chroma, and I'm trying to reindex everything, but the corpus is big and need to wait a day or 2 before having a replica db using llamaindex to index.
Would anyone have experienced the same and could point me in the right direction to get proper faithfulness from Llamaindex. I'm trying to migrate away from Langchain but these results do not help.

7 comments

LLogan M

I mean, I would probably dive into a handful of questions that aren't performing well

response = query_engine.query("...")

Here, you can check response.source_nodes to see if the retrieved nodes make sense

I'm not sure about langchain, but with llama-index, the default top-k is 2. And of course there's a few other things you can do to tweak the performance, but doing the debugging of
a) do my retrieved nodes make sense?
b) does the response make sense for the given nodes

Will help somewhat to track down the issue

You can also just create a retriever to debug retrieval, if that is the issue

Plain Text

retriever = index.as_retriever(similarity_top_k=2)
nodes = retriever.retrieve("test")

LLogan M

I would suspect that maybe the retrieved nodes are lacking some information? Or if not, there is probably a way to make the settings more similart to langchain.

What LLM are you using?

BBaygon

GPT4

We have logged all steps, and on the 115 different questions that we asked through Llamaindex, it always fetched the same context elements. When asking about Apple or Alphabet or any other stock, it returns the same chunk from Bank of America Which is really weird.

LLogan M

yea that sounds pretty strange.... I would suspect that might have something to do with the vector db being created without llama-index. But hard to say without getting my hands on it.

LLogan M

Generally I've never had issues like that working with llamaindex alone

BBaygon

oh lol, we just found the issue.
We passed the custom prompt directly into the standard query engine query string, which confuses the LLM 😂
So focused looking under the hood that we didnt check the basics haha

LLogan M

Oh good find! 👀👍

Add a reply

Find answers from the community

Hi,