Find answers from the community

Home
Members
OsirisMP
O
OsirisMP
Offline, last seen 3 months ago
Joined September 25, 2024
Hello. I really dont understand this and would be greatful for any help. Thank you.

Using the following code:

Settings.embed_model = HuggingFaceEmbedding(model_name="google-bert/bert-base-multilingual-cased")
data = SimpleDirectoryReader(input_dir="./zprava2/", recursive=True).load_data(show_progress=True)
index = VectorStoreIndex.from_documents(data,show_progress=True)

I digest Word document with less then 10 pages. It is in czech language.

Using model_name="google-bert/bert-base-multilingual-cased"
I correctly get many text values, for example list of diagnoses. It returns fine in 95%+.
But there is also a block of blood test results like V_KP = 10, U_RTOP = 15,.... hidden among many other parameters. This embedding NEVER finds any value.

On the contrary, using english model_name="BAAI/bge-base-en-v1.5"
Text is found less probably, in some cases NEVER
But V_KP = 10, U_RTOP = 15, hidden in many other parameters, are now ALWAYS found. Like in 100%.

I tried chunk size to no effect. What is causing this? How to make that czech model find these parameters? Why is not finding that text if word search do?
2 comments
O
L
So I would be happy if someone could help.

Trying to add some csv data to VectoreStoreIndex to query on like "What is the CodeName for Code".

Using SimpleDirectoryReader I gave it csv 100 rows with 2 columns Code and CodeName. Then created index like:
index = VectorStoreIndex.from_documents
It gave 50% wrong answers for given Codes.

So I gave it only 50 rows. It knew everything perfectly. What is the limitation ?


As I dont know why, I tried to split the csvs into 2 with 50, 50 rows using the following code:

data = SimpleDirectoryReader(input_dir="./diagnozy_semicol_noclear_0-50_50-100/").load_data(show_progress=True)
index = VectorStoreIndex.from_documents(data)

It completely forgot first 50 rows but knew perfectly rows 51-100. What is happening? How to teach it more then a few rows?

Thank you so much, I am completele lost.
4 comments
O
W