I mean, I would probably dive into a handful of questions that aren't performing well
response = query_engine.query("...")
Here, you can check
response.source_nodes
to see if the retrieved nodes make sense
I'm not sure about langchain, but with llama-index, the default top-k is 2. And of course there's a few other things you can do to tweak the performance, but doing the debugging of
a) do my retrieved nodes make sense?
b) does the response make sense for the given nodes
Will help somewhat to track down the issue
You can also just create a retriever to debug retrieval, if that is the issue
retriever = index.as_retriever(similarity_top_k=2)
nodes = retriever.retrieve("test")