Hi !
I'm trying to build an advanced RAG Application, and I have one issue during the ingestion step.
- Trying to embed a large CSVFile row per row. However, since it's too big (~95Mb, ~440000 rows) and my embedding speed is like 2048rows / 50sec, it's way too long to embed everything.
Is there a way to improve the performance => Embedding faster.
Here's the python code
if not os.path.exists("vector_index"):
logger.info("CSVReader working..")
reader = CSVReader(concat_rows=False)
nodes = reader.load_data(file=Path(path))
logger.info("CSVReader Done..")
logger.info("Vectorizing..")
index = VectorStoreIndex(nodes, show_progress=True)
logger.info("Vectorizing done..")
logger.info("Storing..")
index.storage_context.persist("vector_index")
logger.info("Storing done..")
else:
logger.info("Loading from the storage..")
storage_context = StorageContext.from_defaults(persist_dir="vector_index")
index = load_index_from_storage(storage_context)