i am looking to see how i can increase

At a glance

The community member is looking to increase the speed of generating embeddings, which is currently taking several minutes for a large file. They are running this on AWS Lambda with LlamaIndex 0.9 and using OpenAI's text-embedding-large-3 model. The community members discuss that the current approach of embedding one-by-one should be changed to a batch approach using the get_text_embedding_batch() method. They also suggest increasing the embed_batch_size parameter on the embedding model. The community members provide sample code to demonstrate how to assign the batch of embeddings back to the corresponding nodes.

Useful resources

ddenen99

i am looking to see how i can increase the speed of generating embeddings, currently a large file is taking several minutes. Shoudl this be moved to a pipeline? This is running on AWS Lambda with LlamaIndex 0.9 (still working on 0.10 upgrade) . Embedding is OpenAI text-embedding-large-3

Plain Text

  def add_nodes(self, nodes):
        return self.vector_store.add(nodes)

    def add_nodes_from_file(
        self, tmpfile, external_id: str, node_parser: NodeParser, embedding: HuggingFaceEmbedding
    ):
        dir_reader = SimpleDirectoryReader(input_files=[tmpfile])
        docs = dir_reader.load_data()
        for doc in docs:
            doc.metadata["external_id"] = external_id

        nodes = node_parser.get_nodes_from_documents(docs)

        for node in nodes:
            node_embeddings = embedding.get_text_embedding(
                node.get_content(metadata_mode="all")
            )
            node.embedding = node_embeddings

        res = self.add_nodes(nodes)
        print("Result from add nodes: " + str(res))
        return res

22 comments

LLogan M

you are embedding one-by-one

LLogan M

Probably uou should be doing batch

LLogan M

Plain Text

embeddings = embedding.get_text_embedding_batch(text_chunks)

LLogan M

You can also increase embed_batch_size on the embed model

ddenen99

oh damn ok , did not know that method exists. How does that batch align with the OpenAI embed_batch_size parameter?

LLogan M

embedding.get_text_embedding_batch(text_chunks) will take all your text chunks and embed in batches of embed_batch_size

By default, I think its 100 for openai

ddenen99

got it yea i am tracing the code i see it

Plain Text

            if idx == len(texts) - 1 or len(cur_batch) == self.embed_batch_size:

ddenen99

Last question is how do you assign a List[Embedding] to the node.embedding if get_embedding() returns an Embedding but get_embedding_batch() returns a List[Embedding] ?

ddenen99

i have all the BaseNode in a nodes (a List[Node]) , i assume you need to pass an array of node.get_content() to the texts param

ddenen99

but then how to you assign the embedding back to each node properly?

LLogan M

They come back in the same order you gave them

LLogan M

Plain Text

embeddings = embedding.get_text_embedding_batch(text_chunks)
for (embedding, node) in zip(embeddings, nodes):
  node.embedding = embedding

LLogan M

something like that

ddenen99

got it.. and text chunks would have to come from texts = [node.get_content() for node in nodes]

ddenen99

after nodes = node_parser.get_nodes_from_documents(docs)

ddenen99

or just node.text, although that looks like it just calls get_content(MetadataMode.NONE)

ddenen99

i am worried about the decoupling of the text_chunks and the nodes

ddenen99

Plain Text

      
        nodes = node_parser.get_nodes_from_documents(docs)
        text_chunks = [node.get_content(metadata_mode="llm") for node in nodes]
        embeddings = embedding.get_text_embedding_batch(text_chunks)

        for (embedding, node) in zip(embeddings, nodes):
            node.embedding = embedding

LLogan M

eh its fine. Its what the index does under the hood anyways

LLogan M

You are just doing it manually now

LLogan M

https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/indices/utils.py#L114

ddenen99

ok looks like my code above does something similar to line 142 there... thank you

Add a reply

Find answers from the community

i am looking to see how i can increase