Hello. I really dont understand this and would be greatful for any help. Thank you.
Using the following code:
Settings.embed_model = HuggingFaceEmbedding(model_name="google-bert/bert-base-multilingual-cased") data = SimpleDirectoryReader(input_dir="./zprava2/", recursive=True).load_data(show_progress=True) index = VectorStoreIndex.from_documents(data,show_progress=True)
I digest Word document with less then 10 pages. It is in czech language.
Using model_name="google-bert/bert-base-multilingual-cased" I correctly get many text values, for example list of diagnoses. It returns fine in 95%+. But there is also a block of blood test results like V_KP = 10, U_RTOP = 15,.... hidden among many other parameters. This embedding NEVER finds any value.
On the contrary, using english model_name="BAAI/bge-base-en-v1.5" Text is found less probably, in some cases NEVER But V_KP = 10, U_RTOP = 15, hidden in many other parameters, are now ALWAYS found. Like in 100%.
I tried chunk size to no effect. What is causing this? How to make that czech model find these parameters? Why is not finding that text if word search do?
Embeddings do not work based purely on word lookup. They work by the model understanding the input text, and mapping it to some vector space.
In this case, my guess is your multilingual embedding model wasn't really trained on medical data, so looking up specific parameters won't work because it doesn't understand it.
For BGE, its not multilingual, so it makes sense its bad at most queries
Thank you so much for answer. Guess its not in my power to train llm. Maybe giving it detailed promt would make it better? Like this S_zk is blabla, U_KH is this. But still, asking it to find a text should not be such a big problem?