Find answers from the community

Updated 11 months ago

This community is probably the experts

This community is probably the experts on this, what methods are there to do information retrieval on highly noisy sources like chat channels (Slack, Discord, etc)? Due to the small amount of text, lack of context, and noise - embedding models don't seem to do well at all with these sources.

7 comments

LLogan M

I think grouping has a lot to do with it

For example, my company slack uses threads consistently. So I would definitely be grouping by threads.

Otherwise, you can also group by time, and maybe create summaries of time-frames to perform retrieval over 🤔

YYuhong Sun

Hm, so we do already group by threads but even then it's too noisy

YYuhong Sun

Also with most embedding models, they tend to match very closely to short passages compared to longer passages so it matches slack messages more than other sources.

We considered doing summaries but passing every message through an LLM is expensive

LLogan M

It definitely is expensive.

You could do some sort of retrieval where when a message is retrieved, you retrieved X messages before and after as well?

YYuhong Sun

That's a good idea, will definitely do that!

Is there any techniques to reduce the issue where short passages match more closely to queries than long passages?

LLogan M

reranking maybe?

LLogan M

retrieve a higher top-k -> rerank

Add a reply