Embedding setup

cchsurf

I want to use sentence-transformers for embeddings but this seems to make llama-index force me to install llama-cpp and download the llama-2-13b which I don't need or want. Is there any way to avoid this?

13 comments

WWhiteFang_Jr

It should not be the case!
If possible, can you share the code with which you are trying with?

cchsurf

@WhiteFang_Jr sure. This is just a slightly modified version of the first section from https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/OpensearchDemo.html where I'm trying to slot in the sentence-transformers:

Plain Text

#!/usr/bin/env python3
from os import getenv
from llama_index import SimpleDirectoryReader
from llama_index.vector_stores import OpensearchVectorStore, OpensearchVectorClient
from llama_index import VectorStoreIndex, StorageContext
from langchain.embeddings import HuggingFaceEmbeddings
from llama_index import ServiceContext, set_global_service_context

from llama_index import Document
from llama_index.vector_stores.types import MetadataFilters, ExactMatchFilter

###Setup the embedding model
embed_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

set_global_service_context(service_context)


endpoint = getenv("OPENSEARCH_ENDPOINT", "http://localhost:9200")
# index to demonstrate the VectorStore impl
idx = getenv("OPENSEARCH_INDEX", "gpt-index-demo")
# load some sample data
documents = SimpleDirectoryReader("../paul_graham_essay/data").load_data()

text_field = "content"

embedding_field = "embedding"

client = OpensearchVectorClient(
    endpoint, idx, 1536, embedding_field=embedding_field, text_field=text_field
)

vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents=documents,
    storage_context=storage_context,
    service_context=service_context
)

I've confirmed that this works if I set the OpenAI api key, but if I omit it, it fires the llamacpp error. I'm only trying to index at this point so I don't think there should be any querying going on. This is a fresh installation of llamaindex from github. What am I doing wrong?

WWhiteFang_Jr

Oh I think it's because the default model have been changed.

Let me be just double sure of this. Checking the changelog

cchsurf

@WhiteFang_Jr thanks! it's strange because in this code I don't think I'm actually trying to use any LLM at all

WWhiteFang_Jr

https://discord.com/channels/1059199217496772688/1059200134518427678/1141810512301142157

Yeah if you check this if openai validation fails, it strikes local default model which is 13b parameter one that you have mentioned.

Sure if you don't want to use llm, just pass llm= None in the service_context. I think that will resolve the issue for you

cchsurf

In the end I want to use OpenAI as the LLM here, but I also wanted to be sure that I'm actually using sentence-transformers for the embeddings generation, and since the above code isn't actually calling an LLM I didn't feel confident about what is happening with the embeddings. I'll try as you suggest.

WWhiteFang_Jr

Sure, If you pass the openai key in the environment or set openai.api_key value. It will use OpenAI as default LLM

cchsurf

right, but I guess what I'm trying to say is, if I provide the sentence-transformers embeddings config, and the openai key, 'everything works', but I wasn't sure if openai wasn't still being used for the embeddings as well. more of a comprehension issue from my end I guess.

WWhiteFang_Jr

Yeah, if you paas the openai key, llm and embedding model will be used from OpenAI only,

But if you pass embed_model from your side, it will only use openAI for llm part and your embed_model for embedding purpose.

It will then not use openai embedding then.

cchsurf

@WhiteFang_Jr cool it worked! I also had to modify the embedding dimensions specified in the client from 1536 to 768, which I guess is just a consequence of the different embedding models? I guess this may also impact search quality since were looking at like half resolution; but the biggest hurdle is solved. I wonder if this embedding dimension is specificed somewhere with each model (probably is). Thank you so much!

Plain Text

client = OpensearchVectorClient(
    endpoint, idx, 768, embedding_field=embedding_field, text_field=text_field
)

vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# initialize an index using our sample data and the client we just created
index = VectorStoreIndex.from_documents(
    documents=documents,
    storage_context=storage_context,
    service_context=service_context
)

cchsurf

Plain Text

 paul_graham_essay % python3 opensearch_eg.py
LLM is explicitly disabled. Using MockLLM.

WWhiteFang_Jr

Yep, different embedding model has different dimensions. That will be up to the model you choose to go with.

Not sure if it is mentioned in llamaindex docs but yes maybe some site which keeps tracks of all the open source embedding model.

cchsurf

Just in case someone else happens upon this, at present the OpenSearch documentation example: https://gpt-index.readthedocs.io/en/latest/examples/vector_stores/OpensearchDemo.html doesn't include an example of how to load an existing index. You cannot use the default load_index_from_storage either. Rather you need to do the following:

Plain Text

client = OpensearchVectorClient(
    endpoint, idx, 768, embedding_field=embedding_field, text_field=text_field
)

vector_store = OpensearchVectorStore(client)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_vector_store(vector_store=vector_store)

and use the VectorStoreIndex.from_vector_store method to load it.

Add a reply

Find answers from the community

Embedding setup