Find answers from the community

Updated 2 months ago

Retrieving Text with BM25 in Korean and Japanese

At a glance

The community members are discussing how to use the BM25 retriever with Korean or Japanese languages. The original poster encountered an error, "ValueError: ko not recognized", as the BM25 library only supports English stopwords by default, along with a few other languages. The community members suggest using the OpenAI tokenizer without stemming, or providing a custom regex pattern for tokenization. They also discuss the challenges of using BM25 for multilingual search, and mention a related library called BMX. The community members share some potential solutions, such as setting skip_stemming=True and using English stopwords, although they acknowledge that this is not an ideal solution. Overall, the discussion focuses on finding ways to adapt the BM25 retriever to work with non-English languages.

Useful resources
how can I use bm25 retriever with Korean or Japanese ? I received this error, ValueError: ko not recognized. Only English stopwords as default, German, Dutch, French, Spanish, Portuguese, Italian, Russian, Swedish, Norwegian, and Chinese are currently supported. Please input a list of stopwords

likely from the stemmer ?
V
L
29 comments
i want to just use openai tokenizer without stemmer, i want to handle multi languages
Hmm, I think I can add a skip_stemming bool param to the contstructor to help with that
although no guarantee that the bm25s tokenize function will tokenize correctly lol

Stemming is pretty important for latin-alphabet languages
Found something interesting
Bm25 has been a pain for hybrid search for non english language
Or skip stemming if the selected language isn't supported ?
Same with the stopwords
yea bmx is based on bm25s actually. I was hoping bm25s would add support, but so far, they havent
I published a change, 0.5.1. You can set skip_stemming=True, you can also provide the regex pattern for the tokenizer (defaults to r"(?u)\b\w\w+\b")
yea i had bug lol
just set english stopwords imo
not ideal, but eh
yea, leave it as en
would be nice if bm25s expanded their language support for stopwords. I see zh is added.
this one, default is using token pattern ? and tokenizer only get used if i provide a function ?
tokenizer is unused, its a deprecated param
default token pattern is probably fine? I might test in regex101 lol
in japanese, there's no space
this is partly why I think bm25 in multilingual works better using just tokenizer
anyway thank you for the patch
Add a reply
Sign up and join the conversation on Discord