I run a Python server for data ingestion

At a glance

The community member runs a Python server for data ingestion and handling queries, and they are able to handle multiple concurrent document ingestion requests efficiently when using OpenAI embeddings. However, when they switch to using HuggingFace embeddings, the server can only process one request at a time, leaving the others stuck indefinitely.

The community members discuss using an embedding server like text-embedding-inference to handle the concurrency issues. They provide documentation on how to use this with LlamaIndex, and suggest running the server using Docker. There is also a discussion around the need to specify the model name both when running the Docker command and when instantiating TextEmbeddingsInference.

The community members also discuss the use of sparse vector generation with the "naver/efficient-splade-VI-BT-large-doc" model from Hugging Face, and whether this would also have similar concurrency issues. The response suggests that the model would only be able to process things sequentially, and that hosting the model on a server would be necessary to better handle requests.

Useful resources

SSayan

I run a Python server for data ingestion and handling queries. The server handles multiple concurrent document ingestion requests efficiently when using:

embed_model = OpenAIEmbedding(embed_batch_size=50)

However, when I switch to using:

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

it can only process one request at a time, leaving the others stuck indefinitely.

Does anyone have advice on how to effectively use HuggingFace embeddings with LlamaIndex on a server that receives a high volume of concurrent requests?

11 comments

LLogan M

When you run a model locally, it is compute bound, which means it cannot do concurrancy. In fact, it cannot even do multiprocessing without creating a copy of the model

I suggest using an embedding server like text-embedding-inference
https://github.com/huggingface/text-embeddings-inference

LLogan M

lol I see someone else suggested the same

SSayan

Got it!

Is there some documentation on how to use this with LlamaIndex?

LLogan M

Sure is 🙂 https://docs.llamaindex.ai/en/stable/examples/embeddings/text_embedding_inference.html

LLogan M

basically, just need to use docker to run the server, and off you go

ddoughboy

@Logan M TEI definitely speeds up RAG when using a local embedding model. Thank you! However, I'm curious why I have to specify model_name both when running the docker command and when instantiating TextEmbeddingsInference

Plain Text

docker run --gpus all -p 8080:80 -v /opt/tei:/data --pull always ghcr.io/huggingface/text-embeddings-inference:1.0 --model-id jinaai/jina-embeddings-v2-base-en

Plain Text

Settings.embed_model = TextEmbeddingsInference(model_name='jinaai/jina-embeddings-v2-base-en')

LLogan M

Mostly for tracing/observability purposes

SSayan

@Logan M regarding this bit:

When you run a model locally, it is compute bound, which means it cannot do concurrancy. In fact, it cannot even do multiprocessing without creating a copy of the model

I suggest using an embedding server like text-embedding-inference
https://github.com/huggingface/text-embeddings-inference

I have a related question. I'm currently working on Hybrid Search (https://docs.llamaindex.ai/en/stable/examples/vector_stores/qdrant_hybrid.html) and the document says

This will run sparse vector generation locally using the "naver/efficient-splade-VI-BT-large-doc" model from Huggingface, in addition to generating dense vectors with OpenAI.

So it looks like the model is being run locally here as well. Do you foresee similar concurrency issues as above with this?

LLogan M

Yea, it would only be able to process things sequentially. You'd have to host the model on a server to better handle requests.

Thankfully that notebook shows how to customize the function that is calling the model

SSayan

Thank you, Logan. I've reviewed the customisation code, but it's not clear how it can be customised to interact with a server. Does this have something equivalent to

Plain Text

from llama_index.embeddings.text_embeddings_inference import (
    TextEmbeddingsInference,
)

LLogan M

So you have a function that is generating sparse vectors

Plain Text

def sparse_doc_vectors(
    texts: List[str],
) -> Tuple[List[List[int]], List[List[float]]]:
    """
    Computes vectors from logits and attention mask using ReLU, log, and max operations.
    """
    tokens = doc_tokenizer(
        texts, truncation=True, padding=True, return_tensors="pt"
    )
    if torch.cuda.is_available():
        tokens = tokens.to("cuda")

    output = doc_model(**tokens)
    ...

So you basically just have to replace the model/tokenizer calls with an API request to some server (it could be anything, I don't know if TEI supports sparse embeddings yet(?))

Add a reply

Find answers from the community

I run a Python server for data ingestion