Hi,

At a glance

Hi,
I have an issue where my RAG code, which is not retrieving information the from the document and the document is embedded as well. What will be the cause of this issue. I have tried many things like changing chunk size, change the top_k retriever as well. The code I am running is as fellows;

pdfdocuments = SimpleDirectoryReader(r"C:\Users\Shaikh.Hammad\MS-Thesis\Data").load_data()
llm = Gemini()
embed_model=LangchainEmbedding( HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))
summarizer = TreeSummarize(
service_context = ServiceContext.from_defaults(
llm=llm, embed_model=embed_model
)
)
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
index=VectorStoreIndex.from_documents(pdfdocuments,service_context=service_context)
retriever = index.as_retriever(similarity_top_k=5)
p = QueryPipeline(verbose=True)
p.add_modules(
{
"input": InputComponent(),
"retriever": retriever,
"summarizer": summarizer,
}
)
p.add_link("input", "retriever")
p.add_link("input", "summarizer", dest_key="query_str")
p.add_link("retriever", "summarizer", dest_key="nodes")
output = p.run(input="What is the PSL 2024 spends")

Output response: The provided context does not mention anything about PSL 2024 spends, so I cannot answer this question from the provided context.

There is document named PSL 2024 Analysis but the model is using PSL 2023 Analysis which contain no information about 2024. Kindly help me regarding this issue that why the model is not using the 2024 document. Does it do with the embeddings?

15 comments

LLogan M

You can check response.source_nodes to see what nodes the LLM read to answer the question

LLogan M

Since you are using a query pipeline, I think its output['response'].source_nodes ? Maybe? Would have to check the dict keys

HHammad

ok, Let me see

HHammad

[NodeWithScore(node=TextNode(id_='6d1de722-53e2-4816-8a36-9426011fd7d3', embedding=None, metadata={'page_label': '2', 'file_name': 'PSL 2024 Analysis.pdf', 'file_path': 'C:\Users\Shaikh.Hammad\MS-Thesis\Data\PSL 2024 Analysis.pdf', 'file_type': 'application/pdf', 'file_size': 1443658, 'creation_date': '2024-04-16', 'last_modified_date': '2024-04-16', 'last_accessed_date': '2024-04-19'}

This is the response I am getting. It is picking the required document but not retrieving the information in it. Can you help why it is showing embedding=none when I am apply the embedding model.

LLogan M

embedding=none is fine, it doesn't attach it mostly to save memory

I'm assuming you removed the actual text from the node right?

HHammad

yes

HHammad

this is the entire text;

LLogan M

Seems like a lot of the data is in tables -- have you looked into using something to format the tables better, like using llamaparse?

LLogan M

Table formatting really impacts the LLMs understanding of what its reading

HHammad

Can you provide the brief explanation about this llamaparse and how to use this in my pipeline.

LLogan M

LlamaParse it just another reader for files

https://github.com/run-llama/llama_parse

I recommend giving it a try

HHammad

Thanks for the guidance

HHammad

Hi Logan,
I have parsed the documents with llama_parser but encountering an error when using this line of code
"index = VectorStoreIndex.from_documents(pdfdocuments , service_context=service_context)".
The Error message is Unknown document type: <class 'llama_index.core.schema.Document'>.
Can you provide some help on this. Thanks.

HHammad

I have imported from llama_index import (
VectorStoreIndex,
ServiceContext,
SimpleDirectoryReader,
load_index_from_storage,
set_global_service_context
)

HHammad

I have removed this error now. Thanks for the guidance.

Add a reply

Find answers from the community

Hi,