Hi everyone! I've been recently diving

At a glance

Hi everyone! I've been recently diving into AI/RAG, particularly using LlamaIndex (Love the project, amazing work!) and open source models. I have 2 questions that might come out of my ignorance, but would be great if anyone could answer:

Why do we use top k chunks in similarity instead of setting a similarity threshold?
Is there a "recommended" maximum size/amount of documents for ingestion? As in, after X amount the model might not perform as good as expected?

Thanks!

2 comments

TTeemu

Setting a good similarity threshold can be difficult and can lead to no results populating your prompt from your knowledge base. It's typically just better to fetch the most similar results (with some exceptions), especially because the models like gpt-4 can pretty well distinguish that the context is irrelevant and not use that to formulate a response. Also sometimes a text chunk might be mostly irrelevant but contain one small piece of data you'd still like the model to have.

The recommended size will differ quite a lot based on how you're structuring your pipeline but in my experience it does take a very large amount of data to notice a reduction in performance and these can typically be negated. Semantic search does pretty well even with a lot of data but sometimes you might find some pollution if you have a massive amount of data that is semantically similar. These cases are a good place generally to implement those similarity cutoffs.

wwrapdepollo

Thanks a lot @Teemu, very informative

Add a reply

Find answers from the community

Hi everyone! I've been recently diving