Hello everyone,

At a glance

The community member is experiencing an issue with llama-index, where the retrieved nodes have scores very close to 1 when using large embedding models like Salesforce/SFR-Embedding-Mistral, GritLM/GritLM-7B, and intfloat/e5-mistral-7b-instruct. However, when using a smaller model like UAE-Large-V1, the scores are around 0.65, which seems more reasonable.

Another community member suggests that embedding models based on large language models (LLMs) may require a different pooling mode, such as "last" instead of the default. They also mention that these large models may not be worth the increased compute power compared to smaller models.

The community member tries the "last" pooling mode and finds that it resolves the issue of the scores being very close to 1. However, they still find the large models unsatisfactory and continue to experiment with different configurations, including adding metadata and post-processing steps.

The community members discuss the score differences between the direct cosine similarity calculation and the retrieved nodes, and try to understand the reasons behind the low scores. They also ask for suggestions on how to tune the embedding models to improve the score system.

jjoe273558

Hello everyone,

I recently started working with llama-index and I've encountered a very weird issue that I cannot find the solution.

I have tried three big embedding models, including Salesforce/SFR-Embedding-Mistral model, GritLM/GritLM-7B model and intfloat/e5-mistral-7b-instruct model. It's very confusing that the nodes retrieved (top_k=10) all have the score higher than 0.9999999999. However, when I turned to use a small embedding model like UAE-Large-V1, the highest score is around 0.65, which seems to be okay.

Plus, I tried to modify my prompt. The results remain same (retrieved nodes may be different, but their scores are still very very close to 1), even if my prompt is 'hello' which has nothing to do with the nodes text retrieved and the contents I feed into the model.

I'm confused about where the problem lies. Below is my code snippet:

Plain Text

llm = AzureOpenAI(
    model="gpt-35-turbo",
    deployment_name='xxxx',
    api_key="xxxxx",
    )
embed_model=HuggingFaceEmbedding(model_name="Salesforce/SFR-Embedding-Mistral",cache_folder='model_cache',
                                device='cuda',embed_batch_size=1,max_length=3072)
Settings.llm = llm
Settings.tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode
Settings.embed_model = embed_model
documents = SimpleDirectoryReader("../2024_report").load_data()
pipeline=IngestionPipeline(transformations=[MarkdownNodeParser(include_metadata=True,include_prev_next_rel=True),
                                            ])
nodes = pipeline.run(documents=documents)
index=VectorStoreIndex(nodes,context,show_progess=True)
query='hello'
retriever=VectorIndexRetriever(index=index,similarity_top_k=10,)
ret_nodes=retriever.retrieve(query)
for ret_node in ret_nodes:
    print(ret_node.score)

I'm reaching out to see if anyone has experienced similar issues. Ant insights or suggestions on how to solve this problem would be greatly appreciated. Thank you.

11 comments

LLogan M

I think embedding models based on LLMs require a different pooling mode

Plain Text

embed_model=HuggingFaceEmbedding(
  model_name="Salesforce/SFR-Embedding-Mistral",
  cache_folder='model_cache',
  device='cuda',
  embed_batch_size=1,
  max_length=3072,
  pooling="last",
)

LLogan M

(This is my unwarranted opinion, but also, these large models generally aren't worth the increase in compute power needed to run them, compared to smaller models)

jjoe273558

Got it, thanks . What does the pooling='last' mean? can you share me a link of documentation about that? And what other pooling mode can we choose?

LLogan M

mean and cls are the other options

LLogan M

Usually it picks it automatically, but I think last is newer/not automatically picked

jjoe273558

Okay, will try and see what happens. Thanks for your suggestion.😆

jjoe273558

Hi Logan, sorry to bother you again.
After I added

Plain Text

pooling='last'

when I loaded the embed_model, the score now is no more 0.999999, which is great. However, I found these big models still not satisfactory.

jjoe273558

Plain Text

embed_model=HuggingFaceEmbedding(model_name="intfloat/e5-mistral-7b-instruct",
                                 cache_folder='folder',
                                 embed_batch_size=1,
                                 max_length=3072,
                                 device='cuda',
                                pooling='last',)
Settings.llm = llm
Settings.tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo").encode
Settings.embed_model = embed_model
class AddDate(TransformComponent):
       xxx 
       return nodes
documents = SimpleDirectoryReader("../2024_report").load_data()
pipeline=IngestionPipeline(transformations=[MarkdownNodeParser(include_metadata=True,include_prev_next_rel=True),
                                            AddDate(),
                                            ])
nodes = pipeline.run(documents=documents)
db=chromadb.PersistentClient(path="../chroma_db/e5_mistral_7b_instruct")
chroma_collection=db.get_or_create_collection("platts_2024")
vector_store=ChromaVectorStore(chroma_collection=chroma_collection)
storage_context=StorageContext.from_defaults(vector_store=vector_store)
index=VectorStoreIndex(nodes,storage_context=storage_context,show_progess=True)

jjoe273558

Plain Text

vector_store_info=VectorStoreInfo(
    content_info="reference date information of the passage",
    metadata_info=[
        MetadataInfo(
            name="ReferenceYear",
            type="int",
            description=("the year of the reference date of the passage"
                        ),
        ),
        MetadataInfo(
            name="ReferenceMonth",
            type="int",
            description=("the month of the reference date of the passage."
                        ),
        ),
        MetadataInfo(
            name="ReferenceDay",
            type="int",
            description=("the day of the reference date of the passage"
                        ),
        ),
    ],
)
class DeleteShortNode(BaseNodePostprocessor):
        xxx 
        return new_nodes
retriever=VectorIndexAutoRetriever(index,vector_store_info=vector_store_info,similarity_top_k=20)
response_synthesizer=get_response_synthesizer()
query_engine=RetrieverQueryEngine(retriever=retriever,
                                  node_postprocessors=[DeleteShortNode()],
                                  response_synthesizer=response_synthesizer,
                                 )
query='What happened to the iron ore markets on 2024-02-13? Use about 500 words to illustrate.'
ret_nodes=query_engine.retrieve(query)

jjoe273558

the highest score in the ret_nodes (which is ret_nodes[0].score) is 0.31, while if we directly calculate the cosine similarity using embed_model:

Plain Text

embed_model.similarity(embed_model.get_text_embedding(query),embed_model.get_text_embedding(ret_nodes[0].text))

, it turns out to be 0.45.
So the score difference is because of the metadata and other data contained in the nodes?
Plus, I don't know why the score is that low, and the retrieved nodes indeed do not have much relationship with my queries.
Is there any way or some other parameters to tune the embedding models so that a better score system can be made? Thank you for your patience for reading my so long question.
I would appreciate so much if you could give some suggestions.☺️

LLogan M

I wouldn't obsess too much about the score. It doesn't really mean much, as long as the correct stuff is retrieved.

And yes, the metadata+text is embedded: node.get_content(metadata_mode="embed")

Not sure what else to say though. Maybe try the mean embedding mode?

Add a reply

Find answers from the community

Hello everyone,