Hybrid

iidontneedonetho

How would I speed up the part between the Generating embeddings sections? Right now it can take up to 15 min before the next set of embeddings is generated. Which is making the whole process take up to 48 hours. This is using hybrid qdrant vector store setup. I'm on an SSD btw.

Plain Text

device = "cuda" if torch.cuda.is_available() else "cpu"
print("GPU available:", torch.cuda.is_available())
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", device=device)
#Settings.chunk_size = 512
qdrantclient = qdrant_client.QdrantClient(path="./qdrant_db")

'''DISCORD DATA'''
print("Loading local files...")
dir_path = 'DiscordDocs'
reader = SimpleDirectoryReader(input_dir=dir_path, required_exts=[".txt"])
discord_docs = reader.load_data()

print("Local files loaded successfully. Setting up vector store for Discord data...")
discord_vector_store = QdrantVectorStore(client=qdrantclient, enable_hybrid=True, batch_size=20, collection_name="discord-data")
discord_storage_context = StorageContext.from_defaults(vector_store=discord_vector_store)

discord_index = VectorStoreIndex.from_documents(discord_docs, storage_context=discord_storage_context, show_progress=True)
print("Discord data setup complete.")

Plain Text

GPU available: True
Loading local files...
Local files loaded successfully. Setting up vector store for Discord data...
Fetching 5 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
Fetching 5 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:00<?, ?it/s]
Parsing nodes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 111/111 [03:59<00:00,  2.16s/it]
Generating embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [00:15<00:00, 131.86it/s]
Generating embeddings: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2048/2048 [00:13<00:00, 151.84it/s]

(I'm still generating embeddings right now)

41 comments

LLogan M

By default qdrant hybrid is using a local model to generate sparse embeddings

LLogan M

If you don't have a GPU, this will be pretty slow

LLogan M

You can customize the function that generates sparse embeddings if you have a better option

iidontneedonetho

ah, I do have a 2080ti that I'm using for my normal embed model

iidontneedonetho

the actual Generating embeddings part with my gpu takes around 15 sec like in the post above. So is it generating the sparse embeddings after the normal embeddings?

iidontneedonetho

and that's just not being shown?

LLogan M

Yea the sparse embeddings get generated after, once vector_store.add() is called

iidontneedonetho

ah, is there a way to display that progress? and you said above I should look into customizing the function that generates sparse embeddings? If so, does hybrid not use gpu by default?

LLogan M

It should be using this
https://github.com/run-llama/llama_index/blob/aad4a6fb94c8fcaf1b7dfac56b88b9e277886bfe/llama-index-integrations/vector_stores/llama-index-vector-stores-qdrant/llama_index/vector_stores/qdrant/utils.py#L67

LLogan M

I'm actually not sure if fastembed supports gpu or not

LLogan M

But you can customize how this function is running

LLogan M

I don't think there's a progress bar here

iidontneedonetho

I'm using huggingface for my default embed model

LLogan M

Yea, this is completely unrelated/separate

iidontneedonetho

ah okay, I'll look into doing sparse indexing a different way to speed it up

iidontneedonetho

Also, I did see a fastembed-gpu version

iidontneedonetho

I'll see if I install that manually, will it use that instead of fastembed

LLogan M

I thiiiink soM

LLogan M

iidontneedonetho

testing now, will report back when I see results

iidontneedonetho

well, actually, is parsing nodes using fastembed? or any embedding model?

LLogan M

Mmm parsing nodes is just splitting text into chunks

iidontneedonetho

damn, I just looked and Parsing nodes hammers a single core and nothing else

LLogan M

Yea it's just string operations

iidontneedonetho

gotta get that multi threaded

iidontneedonetho

It's just hitting that single core really hard

Attachment

iidontneedonetho

I wonder if thats causing the bottle neck, lack of multi threaded processing for these events

iidontneedonetho

Finished the first embed, that spike in my gpu, and now it's doing sparse nodes, I'm guessing from what you said, which doesn't seem to be running on gpu still

Attachment

iidontneedonetho

it's also pulsing a lot

iidontneedonetho

it's causing cpu speed to fluctuate which means it's processing slower right?

iidontneedonetho

I'm gunna cancel that and do it again, this time, no hybrid, just straight vector store and see how that goes

iidontneedonetho

yah, the sparse node creation is extremely time consuming

iidontneedonetho

Gotta find a way to do it faster

iidontneedonetho

would it be possible to put, if the gpu is available, sparse embedding on the gpu too?

iidontneedonetho

because fastembed-gpu is a package that uses the gpu

iidontneedonetho

Actually, reading up a little bit, on L23, it seems like there are two options for generating sparse embeds, default_sparse_encoder and fastembed_sparse_encoder

iidontneedonetho

is there a way to choose which we use?

iidontneedonetho

or print which is being used?

iidontneedonetho

Looking at the class SparseTextEmbedding on L46, you could pull this in to your current code and then change to fastembed-gpu as the default install rather than fastembed. This might help with the speed at which these sparse nodes are generated @Logan M

iidontneedonetho

Using GPT to spit ball code idea:

Plain Text

def fastembed_sparse_encoder(
    model_name: str = "prithvida/Splade_PP_en_v1",
    batch_size: int = 256,
    cache_dir: Optional[str] = None,
    threads: Optional[int] = None,
    device: Optional[str] = None,
) -> SparseEncoderCallable:
    try:
        from fastembed.sparse.sparse_text_embedding import SparseTextEmbedding
        from fastembed.common import OnnxProvider
        import torch
    except ImportError as e:
        raise ImportError(
            "Could not import FastEmbed. "
            "Please install it with `pip install fastembed`"
        ) from e

    if device is None:
        device = "cuda" if torch.cuda.is_available() else "cpu"

    providers = [OnnxProvider.CUDAExecutionProvider] if device == "cuda" else [OnnxProvider.CPUExecutionProvider]

    model = SparseTextEmbedding(model_name, cache_dir=cache_dir, threads=threads, providers=providers)

    def compute_vectors(texts: List[str]) -> BatchSparseEncoding:
        embeddings = model.embed(texts, batch_size=batch_size)
        indices, values = zip(
            *[
                (embedding.indices.tolist(), embedding.values.tolist())
                for embedding in embeddings
            ]
        )
        return list(indices), list(values)

    return compute_vectors

LLogan M

You can customize the function used to encode sparse emebeddings

There's an example here, you just pass in a callable

https://docs.llamaindex.ai/en/stable/examples/vector_stores/qdrant_hybrid/?h=qdrant+hybrid#customizing-sparse-vector-generation

Add a reply

Find answers from the community

Hybrid