Hello. I really dont understand this and would be greatful for any help. Thank you.
Using the following code:
Settings.embed_model = HuggingFaceEmbedding(model_name="google-bert/bert-base-multilingual-cased") data = SimpleDirectoryReader(input_dir="./zprava2/", recursive=True).load_data(show_progress=True) index = VectorStoreIndex.from_documents(data,show_progress=True)
I digest Word document with less then 10 pages. It is in czech language.
Using model_name="google-bert/bert-base-multilingual-cased" I correctly get many text values, for example list of diagnoses. It returns fine in 95%+. But there is also a block of blood test results like V_KP = 10, U_RTOP = 15,.... hidden among many other parameters. This embedding NEVER finds any value.
On the contrary, using english model_name="BAAI/bge-base-en-v1.5" Text is found less probably, in some cases NEVER But V_KP = 10, U_RTOP = 15, hidden in many other parameters, are now ALWAYS found. Like in 100%.
I tried chunk size to no effect. What is causing this? How to make that czech model find these parameters? Why is not finding that text if word search do?
Trying to add some csv data to VectoreStoreIndex to query on like "What is the CodeName for Code".
Using SimpleDirectoryReader I gave it csv 100 rows with 2 columns Code and CodeName. Then created index like: index = VectorStoreIndex.from_documents It gave 50% wrong answers for given Codes.
So I gave it only 50 rows. It knew everything perfectly. What is the limitation ?
As I dont know why, I tried to split the csvs into 2 with 50, 50 rows using the following code:
data = SimpleDirectoryReader(input_dir="./diagnozy_semicol_noclear_0-50_50-100/").load_data(show_progress=True) index = VectorStoreIndex.from_documents(data)
It completely forgot first 50 rows but knew perfectly rows 51-100. What is happening? How to teach it more then a few rows?