embed_model = OpenAIEmbedding(embed_batch_size=50)
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
model_name
both when running the docker command and when instantiating TextEmbeddingsInference
docker run --gpus all -p 8080:80 -v /opt/tei:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id jinaai/jina-embeddings-v2-base-en
Settings.embed_model = TextEmbeddingsInference(model_name='jinaai/jina-embeddings-v2-base-en')
When you run a model locally, it is compute bound, which means it cannot do concurrancy. In fact, it cannot even do multiprocessing without creating a copy of the model
I suggest using an embedding server like text-embedding-inference
https://github.com/huggingface/text-embeddings-inference
This will run sparse vector generation locally using the "naver/efficient-splade-VI-BT-large-doc" model from Huggingface, in addition to generating dense vectors with OpenAI.
from llama_index.embeddings.text_embeddings_inference import ( TextEmbeddingsInference, )
def sparse_doc_vectors( texts: List[str], ) -> Tuple[List[List[int]], List[List[float]]]: """ Computes vectors from logits and attention mask using ReLU, log, and max operations. """ tokens = doc_tokenizer( texts, truncation=True, padding=True, return_tensors="pt" ) if torch.cuda.is_available(): tokens = tokens.to("cuda") output = doc_model(**tokens) ...