Find answers from the community

Updated 2 weeks ago

Retrieving Text with BM25 in Korean and Japanese

how can I use bm25 retriever with Korean or Japanese ? I received this error, ValueError: ko not recognized. Only English stopwords as default, German, Dutch, French, Spanish, Portuguese, Italian, Russian, Swedish, Norwegian, and Chinese are currently supported. Please input a list of stopwords

likely from the stemmer ?
V
L
29 comments
i want to just use openai tokenizer without stemmer, i want to handle multi languages
Hmm, I think I can add a skip_stemming bool param to the contstructor to help with that
although no guarantee that the bm25s tokenize function will tokenize correctly lol

Stemming is pretty important for latin-alphabet languages
Found something interesting
Bm25 has been a pain for hybrid search for non english language
Or skip stemming if the selected language isn't supported ?
Same with the stopwords
yea bmx is based on bm25s actually. I was hoping bm25s would add support, but so far, they havent
I published a change, 0.5.1. You can set skip_stemming=True, you can also provide the regex pattern for the tokenizer (defaults to r"(?u)\b\w\w+\b")
yea i had bug lol
just set english stopwords imo
not ideal, but eh
yea, leave it as en
would be nice if bm25s expanded their language support for stopwords. I see zh is added.
this one, default is using token pattern ? and tokenizer only get used if i provide a function ?
tokenizer is unused, its a deprecated param
default token pattern is probably fine? I might test in regex101 lol
in japanese, there's no space
this is partly why I think bm25 in multilingual works better using just tokenizer
anyway thank you for the patch
Add a reply
Sign up and join the conversation on Discord