Find answers from the community

Updated 8 months ago

i am looking to see how i can increase

i am looking to see how i can increase the speed of generating embeddings, currently a large file is taking several minutes. Shoudl this be moved to a pipeline? This is running on AWS Lambda with LlamaIndex 0.9 (still working on 0.10 upgrade) . Embedding is OpenAI text-embedding-large-3

Plain Text
  def add_nodes(self, nodes):
        return self.vector_store.add(nodes)

    def add_nodes_from_file(
        self, tmpfile, external_id: str, node_parser: NodeParser, embedding: HuggingFaceEmbedding
    ):
        dir_reader = SimpleDirectoryReader(input_files=[tmpfile])
        docs = dir_reader.load_data()
        for doc in docs:
            doc.metadata["external_id"] = external_id

        nodes = node_parser.get_nodes_from_documents(docs)

        for node in nodes:
            node_embeddings = embedding.get_text_embedding(
                node.get_content(metadata_mode="all")
            )
            node.embedding = node_embeddings

        res = self.add_nodes(nodes)
        print("Result from add nodes: " + str(res))
        return res
L
d
22 comments
you are embedding one-by-one
Probably uou should be doing batch
Plain Text
embeddings = embedding.get_text_embedding_batch(text_chunks)
You can also increase embed_batch_size on the embed model
oh damn ok , did not know that method exists. How does that batch align with the OpenAI embed_batch_size parameter?
embedding.get_text_embedding_batch(text_chunks) will take all your text chunks and embed in batches of embed_batch_size

By default, I think its 100 for openai
got it yea i am tracing the code i see it

Plain Text
            if idx == len(texts) - 1 or len(cur_batch) == self.embed_batch_size:
Last question is how do you assign a List[Embedding] to the node.embedding if get_embedding() returns an Embedding but get_embedding_batch() returns a List[Embedding] ?
i have all the BaseNode in a nodes (a List[Node]) , i assume you need to pass an array of node.get_content() to the texts param
but then how to you assign the embedding back to each node properly?
They come back in the same order you gave them
Plain Text
embeddings = embedding.get_text_embedding_batch(text_chunks)
for (embedding, node) in zip(embeddings, nodes):
  node.embedding = embedding
something like that
got it.. and text chunks would have to come from texts = [node.get_content() for node in nodes]
after nodes = node_parser.get_nodes_from_documents(docs)
or just node.text, although that looks like it just calls get_content(MetadataMode.NONE)
i am worried about the decoupling of the text_chunks and the nodes
Plain Text
      
        nodes = node_parser.get_nodes_from_documents(docs)
        text_chunks = [node.get_content(metadata_mode="llm") for node in nodes]
        embeddings = embedding.get_text_embedding_batch(text_chunks)

        for (embedding, node) in zip(embeddings, nodes):
            node.embedding = embedding
eh its fine. Its what the index does under the hood anyways
You are just doing it manually now
ok looks like my code above does something similar to line 142 there... thank you
Add a reply
Sign up and join the conversation on Discord