Hi !

Hi !

I'm trying to build an advanced RAG Application, and I have one issue during the ingestion step.

Trying to embed a large CSVFile row per row. However, since it's too big (~95Mb, ~440000 rows) and my embedding speed is like 2048rows / 50sec, it's way too long to embed everything.

Is there a way to improve the performance => Embedding faster.

Here's the python code

Plain Text

if not os.path.exists("vector_index"):
        logger.info("CSVReader working..")
        reader = CSVReader(concat_rows=False)
        nodes = reader.load_data(file=Path(path))
        logger.info("CSVReader Done..")

        logger.info("Vectorizing..")
        index = VectorStoreIndex(nodes, show_progress=True)
        logger.info("Vectorizing done..")
        logger.info("Storing..")
        index.storage_context.persist("vector_index")
        logger.info("Storing done..")
    else:
        logger.info("Loading from the storage..")
        storage_context = StorageContext.from_defaults(persist_dir="vector_index")
        index = load_index_from_storage(storage_context)

Find answers from the community

Hi !