The community member is using a multilingual BERT model to extract text from a Czech language Word document, but is having trouble finding specific blood test results like "V_KP = 10, U_RTOP = 15" that are hidden among other parameters. The community member notes that the English BGE model is able to find these specific values, but the multilingual BERT model struggles with them.
In the comments, another community member suggests that the issue may be due to the multilingual BERT model not being trained on medical data, so it doesn't understand the specific terminology. They also note that the BGE model is not multilingual, which explains why it performs better on the English-specific task.
The community member asks if providing more detailed prompts to the model could help, but the other community member indicates that training the language model on the specific domain would likely be required to improve its performance on these types of tasks.
Hello. I really dont understand this and would be greatful for any help. Thank you.
Using the following code:
Settings.embed_model = HuggingFaceEmbedding(model_name="google-bert/bert-base-multilingual-cased") data = SimpleDirectoryReader(input_dir="./zprava2/", recursive=True).load_data(show_progress=True) index = VectorStoreIndex.from_documents(data,show_progress=True)
I digest Word document with less then 10 pages. It is in czech language.
Using model_name="google-bert/bert-base-multilingual-cased" I correctly get many text values, for example list of diagnoses. It returns fine in 95%+. But there is also a block of blood test results like V_KP = 10, U_RTOP = 15,.... hidden among many other parameters. This embedding NEVER finds any value.
On the contrary, using english model_name="BAAI/bge-base-en-v1.5" Text is found less probably, in some cases NEVER But V_KP = 10, U_RTOP = 15, hidden in many other parameters, are now ALWAYS found. Like in 100%.
I tried chunk size to no effect. What is causing this? How to make that czech model find these parameters? Why is not finding that text if word search do?
Embeddings do not work based purely on word lookup. They work by the model understanding the input text, and mapping it to some vector space.
In this case, my guess is your multilingual embedding model wasn't really trained on medical data, so looking up specific parameters won't work because it doesn't understand it.
For BGE, its not multilingual, so it makes sense its bad at most queries
Thank you so much for answer. Guess its not in my power to train llm. Maybe giving it detailed promt would make it better? Like this S_zk is blabla, U_KH is this. But still, asking it to find a text should not be such a big problem?