Find answers from the community

Updated 9 months ago

i am looking to see how i can increase

At a glance

The community member is looking to increase the speed of generating embeddings, which is currently taking several minutes for a large file. They are running this on AWS Lambda with LlamaIndex 0.9 and using OpenAI's text-embedding-large-3 model. The community members discuss that the current approach of embedding one-by-one should be changed to a batch approach using the get_text_embedding_batch() method. They also suggest increasing the embed_batch_size parameter on the embedding model. The community members provide sample code to demonstrate how to assign the batch of embeddings back to the corresponding nodes.

Useful resources
i am looking to see how i can increase the speed of generating embeddings, currently a large file is taking several minutes. Shoudl this be moved to a pipeline? This is running on AWS Lambda with LlamaIndex 0.9 (still working on 0.10 upgrade) . Embedding is OpenAI text-embedding-large-3

Plain Text
  def add_nodes(self, nodes):
        return self.vector_store.add(nodes)

    def add_nodes_from_file(
        self, tmpfile, external_id: str, node_parser: NodeParser, embedding: HuggingFaceEmbedding
    ):
        dir_reader = SimpleDirectoryReader(input_files=[tmpfile])
        docs = dir_reader.load_data()
        for doc in docs:
            doc.metadata["external_id"] = external_id

        nodes = node_parser.get_nodes_from_documents(docs)

        for node in nodes:
            node_embeddings = embedding.get_text_embedding(
                node.get_content(metadata_mode="all")
            )
            node.embedding = node_embeddings

        res = self.add_nodes(nodes)
        print("Result from add nodes: " + str(res))
        return res
L
d
22 comments
you are embedding one-by-one
Probably uou should be doing batch
Plain Text
embeddings = embedding.get_text_embedding_batch(text_chunks)
You can also increase embed_batch_size on the embed model
oh damn ok , did not know that method exists. How does that batch align with the OpenAI embed_batch_size parameter?
embedding.get_text_embedding_batch(text_chunks) will take all your text chunks and embed in batches of embed_batch_size

By default, I think its 100 for openai
got it yea i am tracing the code i see it

Plain Text
            if idx == len(texts) - 1 or len(cur_batch) == self.embed_batch_size:
Last question is how do you assign a List[Embedding] to the node.embedding if get_embedding() returns an Embedding but get_embedding_batch() returns a List[Embedding] ?
i have all the BaseNode in a nodes (a List[Node]) , i assume you need to pass an array of node.get_content() to the texts param
but then how to you assign the embedding back to each node properly?
They come back in the same order you gave them
Plain Text
embeddings = embedding.get_text_embedding_batch(text_chunks)
for (embedding, node) in zip(embeddings, nodes):
  node.embedding = embedding
something like that
got it.. and text chunks would have to come from texts = [node.get_content() for node in nodes]
after nodes = node_parser.get_nodes_from_documents(docs)
or just node.text, although that looks like it just calls get_content(MetadataMode.NONE)
i am worried about the decoupling of the text_chunks and the nodes
Plain Text
      
        nodes = node_parser.get_nodes_from_documents(docs)
        text_chunks = [node.get_content(metadata_mode="llm") for node in nodes]
        embeddings = embedding.get_text_embedding_batch(text_chunks)

        for (embedding, node) in zip(embeddings, nodes):
            node.embedding = embedding
eh its fine. Its what the index does under the hood anyways
You are just doing it manually now
ok looks like my code above does something similar to line 142 there... thank you
Add a reply
Sign up and join the conversation on Discord