Seeking Advice on Optimizing Index Creation Time for Ingestion Pipeline
I'm developing an ingestion pipeline using llama_index to process Paul Graham's essays. Despite successful data loading and node creation, the indexing phase with VectorStoreIndex is extremely slow, taking over two hours and still running.
Setup Overview:
- Data Loading:
from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_files=["data/paul_graham_essays.txt"])
docs = reader.load_data()
- Embedding Configuration:
- LLM: llama3:instruct from Ollama.
- Embedding: BAAI/bge-small-en-v1.5 with batch size 50.
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
Settings.llm = Ollama(model="llama3:instruct", request_timeout=90.0)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", embed_batch_size=50)
- Ingestion Pipeline:
- Features: Sentence splitting and title extraction.
import nest_asyncio
nest_asyncio.apply()
from llama_index.core import IngestionPipeline, Document
pipeline = IngestionPipeline(transformations=[SentenceSplitter(chunk_size=25), TitleExtractor()])
nodes = pipeline.run(docs)
- Index Creation:
from llama_index.storage import VectorStoreIndex
index = VectorStoreIndex(nodes)
Does anyone have insights on optimizing this indexing process to reduce time? Any advice or experiences with similar challenges would be greatly appreciated.