Seeking Advice on Optimizing Index

Seeking Advice on Optimizing Index Creation Time for Ingestion Pipeline

I'm developing an ingestion pipeline using llama_index to process Paul Graham's essays. Despite successful data loading and node creation, the indexing phase with VectorStoreIndex is extremely slow, taking over two hours and still running.

Setup Overview:

Data Loading:

from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_files=["data/paul_graham_essays.txt"])
docs = reader.load_data()

Embedding Configuration:
LLM: llama3:instruct from Ollama.
Embedding: BAAI/bge-small-en-v1.5 with batch size 50.

from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
Settings.llm = Ollama(model="llama3:instruct", request_timeout=90.0)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", embed_batch_size=50)

Ingestion Pipeline:
Features: Sentence splitting and title extraction.

import nest_asyncio
nest_asyncio.apply()
from llama_index.core import IngestionPipeline, Document
pipeline = IngestionPipeline(transformations=[SentenceSplitter(chunk_size=25), TitleExtractor()])
nodes = pipeline.run(docs)

Index Creation:

from llama_index.storage import VectorStoreIndex
index = VectorStoreIndex(nodes)

Does anyone have insights on optimizing this indexing process to reduce time? Any advice or experiences with similar challenges would be greatly appreciated.

Find answers from the community

Seeking Advice on Optimizing Index