How big is your index? Did you check response.source_nodes
to see if the proper text was retrieved? Did you play around with top k or chunk size settings?
For bigger indexes, some more complex retrieval methods might be needed
hi Logan - thanks for your reply! I didn't check response.source_nodes - I will do that... right now this isn't working even w/small subsets of data so I am guessing there is a problem w/my index... will circle back after I check response.source_nodes
ok that was illuminating... it's pulling back the wrong context.
I am not sure where to go from here. I am currently using the simplest setup possible. I've scaled back from the Sub Question Query Engine & Chatbot just bc those were also not finding expected context, so I stripped everything back to the most simple code and reduced document volume to try to troubleshoot. Now I can clearly see it's pulling incorrect context, even when I search with a unique name in the query. Any suggestions on where to go from here? Should I re-index/re-embed? and if so, how... all I did was use the standard VectorStoreIndex.from_documents -- guessing that was my error?
Can you give some examples of queries you are trying, that aren't working?
Certain "categories" of queries might require some more config, beyond the basic vector index
I just made a change to my index setup on the very small test data set.. i added chunk_overlap=100, and now the query works
Interesting, that works! I was going to suggest smaller chunk size or hybrid search lol
thanks - I am not sure that I know what you mean by Hybrid Search, but... I am going to re-embed the full data set with this overlap param and try again in the chatbot w/Sub Question Query Engine
maybe too ambitious, but I will report back either way. thank you so much for being on the other end of this discord chat!
And yea, happy to help. Good luck!
thanks, I will take a look!
I said I would report back and so far, not having much luck but I have not tried the hybrid search yet. I am starting to think that doing RAG over emails is really hard and maybe I should start with some other set of data/docs. either that, or I am doing something wrong with creating the embeddings.
Hmm, I guess emails are a little tricky.
I have a feeling hybrid will help quite a bit.
When you create your documents before embedding, is it one document per email? Or does it include an entire chain of emails per document?