Find answers from the community

Updated 11 months ago

This community is probably the experts

This community is probably the experts on this, what methods are there to do information retrieval on highly noisy sources like chat channels (Slack, Discord, etc)? Due to the small amount of text, lack of context, and noise - embedding models don't seem to do well at all with these sources.
L
Y
7 comments
I think grouping has a lot to do with it

For example, my company slack uses threads consistently. So I would definitely be grouping by threads.

Otherwise, you can also group by time, and maybe create summaries of time-frames to perform retrieval over πŸ€”
Hm, so we do already group by threads but even then it's too noisy
Also with most embedding models, they tend to match very closely to short passages compared to longer passages so it matches slack messages more than other sources.

We considered doing summaries but passing every message through an LLM is expensive
It definitely is expensive.

You could do some sort of retrieval where when a message is retrieved, you retrieved X messages before and after as well?
That's a good idea, will definitely do that!

Is there any techniques to reduce the issue where short passages match more closely to queries than long passages?
reranking maybe?
retrieve a higher top-k -> rerank
Add a reply
Sign up and join the conversation on Discord