Find answers from the community

Updated 2 months ago

Hello everyone,

Hello everyone,

I recently started working with llama-index and I've encountered a very weird issue that I cannot find the solution.

I have tried three big embedding models, including Salesforce/SFR-Embedding-Mistral model, GritLM/GritLM-7B model and intfloat/e5-mistral-7b-instruct model. It's very confusing that the nodes retrieved (top_k=10) all have the score higher than 0.9999999999. However, when I turned to use a small embedding model like UAE-Large-V1, the highest score is around 0.65, which seems to be okay.

Plus, I tried to modify my prompt. The results remain same (retrieved nodes may be different, but their scores are still very very close to 1), even if my prompt is 'hello' which has nothing to do with the nodes text retrieved and the contents I feed into the model.

I'm confused about where the problem lies. Below is my code snippet:

Plain Text
llm = AzureOpenAI(
    model="gpt-35-turbo",
    deployment_name='xxxx',
    api_key="xxxxx",
    )
embed_model=HuggingFaceEmbedding(model_name="Salesforce/SFR-Embedding-Mistral",cache_folder='model_cache',
                                device='cuda',embed_batch_size=1,max_length=3072)
Settings.llm = llm
Settings.tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode
Settings.embed_model = embed_model
documents = SimpleDirectoryReader("../2024_report").load_data()
pipeline=IngestionPipeline(transformations=[MarkdownNodeParser(include_metadata=True,include_prev_next_rel=True),
                                            ])
nodes = pipeline.run(documents=documents)
index=VectorStoreIndex(nodes,context,show_progess=True)
query='hello'
retriever=VectorIndexRetriever(index=index,similarity_top_k=10,)
ret_nodes=retriever.retrieve(query)
for ret_node in ret_nodes:
    print(ret_node.score)

I'm reaching out to see if anyone has experienced similar issues. Ant insights or suggestions on how to solve this problem would be greatly appreciated. Thank you.
L
j
11 comments
I think embedding models based on LLMs require a different pooling mode

Plain Text
embed_model=HuggingFaceEmbedding(
  model_name="Salesforce/SFR-Embedding-Mistral",
  cache_folder='model_cache',
  device='cuda',
  embed_batch_size=1,
  max_length=3072,
  pooling="last",
)
(This is my unwarranted opinion, but also, these large models generally aren't worth the increase in compute power needed to run them, compared to smaller models)
Got it, thanks . What does the pooling='last' mean? can you share me a link of documentation about that? And what other pooling mode can we choose?
mean and cls are the other options
Usually it picks it automatically, but I think last is newer/not automatically picked
Okay, will try and see what happens. Thanks for your suggestion.πŸ˜†
Hi Logan, sorry to bother you again.
After I added
Plain Text
pooling='last'
when I loaded the embed_model, the score now is no more 0.999999, which is great. However, I found these big models still not satisfactory.
Plain Text
embed_model=HuggingFaceEmbedding(model_name="intfloat/e5-mistral-7b-instruct",
                                 cache_folder='folder',
                                 embed_batch_size=1,
                                 max_length=3072,
                                 device='cuda',
                                pooling='last',)
Settings.llm = llm
Settings.tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode
Settings.embed_model = embed_model
class AddDate(TransformComponent):
       xxx 
       return nodes
documents = SimpleDirectoryReader("../2024_report").load_data()
pipeline=IngestionPipeline(transformations=[MarkdownNodeParser(include_metadata=True,include_prev_next_rel=True),
                                            AddDate(),
                                            ])
nodes = pipeline.run(documents=documents)
db=chromadb.PersistentClient(path="../chroma_db/e5_mistral_7b_instruct")
chroma_collection=db.get_or_create_collection("platts_2024")
vector_store=ChromaVectorStore(chroma_collection=chroma_collection)
storage_context=StorageContext.from_defaults(vector_store=vector_store)
index=VectorStoreIndex(nodes,storage_context=storage_context,show_progess=True)
Plain Text
vector_store_info=VectorStoreInfo(
    content_info="reference date information of the passage",
    metadata_info=[
        MetadataInfo(
            name="ReferenceYear",
            type="int",
            description=("the year of the reference date of the passage"
                        ),
        ),
        MetadataInfo(
            name="ReferenceMonth",
            type="int",
            description=("the month of the reference date of the passage."
                        ),
        ),
        MetadataInfo(
            name="ReferenceDay",
            type="int",
            description=("the day of the reference date of the passage"
                        ),
        ),
    ],
)
class DeleteShortNode(BaseNodePostprocessor):
        xxx 
        return new_nodes
retriever=VectorIndexAutoRetriever(index,vector_store_info=vector_store_info,similarity_top_k=20)
response_synthesizer=get_response_synthesizer()
query_engine=RetrieverQueryEngine(retriever=retriever,
                                  node_postprocessors=[DeleteShortNode()],
                                  response_synthesizer=response_synthesizer,
                                 )
query='What happened to the iron ore markets on 2024-02-13? Use about 500 words to illustrate.'
ret_nodes=query_engine.retrieve(query)
the highest score in the ret_nodes (which is ret_nodes[0].score) is 0.31, while if we directly calculate the cosine similarity using embed_model:
Plain Text
embed_model.similarity(embed_model.get_text_embedding(query),embed_model.get_text_embedding(ret_nodes[0].text))
, it turns out to be 0.45.
So the score difference is because of the metadata and other data contained in the nodes?
Plus, I don't know why the score is that low, and the retrieved nodes indeed do not have much relationship with my queries.
Is there any way or some other parameters to tune the embedding models so that a better score system can be made? Thank you for your patience for reading my so long question.
I would appreciate so much if you could give some suggestions.☺️
I wouldn't obsess too much about the score. It doesn't really mean much, as long as the correct stuff is retrieved.

And yes, the metadata+text is embedded: node.get_content(metadata_mode="embed")

Not sure what else to say though. Maybe try the mean embedding mode?
Add a reply
Sign up and join the conversation on Discord