LlamaIndex

Log inLog into community

Find answers from the community

Updated 2 months ago

Retrieving Text with BM25 in Korean and Japanese

Retrieving Text with BM25 in Korean and Japanese

At a glance

The community members are discussing how to use the BM25 retriever with Korean or Japanese languages. The original poster encountered an error, "ValueError: ko not recognized", as the BM25 library only supports English stopwords by default, along with a few other languages. The community members suggest using the OpenAI tokenizer without stemming, or providing a custom regex pattern for tokenization. They also discuss the challenges of using BM25 for multilingual search, and mention a related library called BMX. The community members share some potential solutions, such as setting skip_stemming=True and using English stopwords, although they acknowledge that this is not an ideal solution. Overall, the discussion focuses on finding ways to adapt the BM25 retriever to work with non-English languages.

Useful resources

·

how can I use bm25 retriever with Korean or Japanese ? I received this error,

ValueError: ko not recognized. Only English stopwords as default, German, Dutch, French, Spanish, Portuguese, Italian, Russian, Swedish, Norwegian, and Chinese are currently supported. Please input a list of stopwords

likely from the stemmer ?

V

L

29 comments

i want to just use openai tokenizer without stemmer, i want to handle multi languages

Hmm, I think I can add a skip_stemming bool param to the contstructor to help with that

although no guarantee that the bm25s tokenize function will tokenize correctly lol

Stemming is pretty important for latin-alphabet languages

https://arxiv.org/html/2407.03618v1

Yea, its using this repo
https://github.com/xhluca/bm25s

https://www.mixedbread.ai/blog/intro-bmx

Found something interesting

https://github.com/mixedbread-ai/baguetter/blob/main/baguetter/indices/sparse/bmx.py

Bm25 has been a pain for hybrid search for non english language

Or skip stemming if the selected language isn't supported ?

Same with the stopwords

yea bmx is based on bm25s actually. I was hoping bm25s would add support, but so far, they havent

I published a change, 0.5.1. You can set skip_stemming=True, you can also provide the regex pattern for the tokenizer (defaults to r"(?u)\b\w\w+\b")

0.5.2 now

yea i had bug lol

whoops

just set english stopwords imo

not ideal, but eh

in language ?

yea, leave it as en

would be nice if bm25s expanded their language support for stopwords. I see zh is added.

this one, default is using token pattern ? and tokenizer only get used if i provide a function ?

tokenizer is unused, its a deprecated param

default token pattern is probably fine? I might test in regex101 lol

in japanese, there's no space

this is partly why I think bm25 in multilingual works better using just tokenizer

idk

anyway thank you for the patch

😄

Add a reply

Sign up and join the conversation on Discord