Find answers from the community

Updated 6 months ago

Hi !

Hi !

I'm trying to build an advanced RAG Application, and I have one issue during the ingestion step.

  • Trying to embed a large CSVFile row per row. However, since it's too big (~95Mb, ~440000 rows) and my embedding speed is like 2048rows / 50sec, it's way too long to embed everything.
Is there a way to improve the performance => Embedding faster.
Here's the python code
Plain Text
if not os.path.exists("vector_index"):
        logger.info("CSVReader working..")
        reader = CSVReader(concat_rows=False)
        nodes = reader.load_data(file=Path(path))
        logger.info("CSVReader Done..")

        logger.info("Vectorizing..")
        index = VectorStoreIndex(nodes, show_progress=True)
        logger.info("Vectorizing done..")
        logger.info("Storing..")
        index.storage_context.persist("vector_index")
        logger.info("Storing done..")
    else:
        logger.info("Loading from the storage..")
        storage_context = StorageContext.from_defaults(persist_dir="vector_index")
        index = load_index_from_storage(storage_context)
W
A
2 comments
You can try increasing the batch_size in your embed_model.

Plain Text
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(embed_batch_size=500)

default value is 100
Alright - will try it ๐Ÿ˜ฎ
Add a reply
Sign up and join the conversation on Discord