Find answers from the community

Updated 7 months ago

Seeking Advice on Optimizing Index

Seeking Advice on Optimizing Index Creation Time for Ingestion Pipeline

I'm developing an ingestion pipeline using llama_index to process Paul Graham's essays. Despite successful data loading and node creation, the indexing phase with VectorStoreIndex is extremely slow, taking over two hours and still running.

Setup Overview:
  1. Data Loading:
from llama_index.core import SimpleDirectoryReader
reader = SimpleDirectoryReader(input_files=["data/paul_graham_essays.txt"])
docs = reader.load_data()

  1. Embedding Configuration:
  2. LLM: llama3:instruct from Ollama.
  3. Embedding: BAAI/bge-small-en-v1.5 with batch size 50.
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings
Settings.llm = Ollama(model="llama3:instruct", request_timeout=90.0)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", embed_batch_size=50)

  1. Ingestion Pipeline:
  2. Features: Sentence splitting and title extraction.
import nest_asyncio
nest_asyncio.apply()
from llama_index.core import IngestionPipeline, Document
pipeline = IngestionPipeline(transformations=[SentenceSplitter(chunk_size=25), TitleExtractor()])
nodes = pipeline.run(docs)

  1. Index Creation:
from llama_index.storage import VectorStoreIndex
index = VectorStoreIndex(nodes)

Does anyone have insights on optimizing this indexing process to reduce time? Any advice or experiences with similar challenges would be greatly appreciated.
W
A
3 comments
  1. What is the size of docs that youa re ingesting?
  2. Are you indexing in every iteration? Try storing the indexed data so that in 2nd iteration you dont have to index again.
  1. If the size is large, I would recommend using third party vector store like qdrant , chroma
The size of the documents I am ingesting is 3 MB, consisting of a single .txt file. & I am currently creating the index once after processing all documents through the ingestion pipeline
3mb is very small, it should not take that much time tbh
Add a reply
Sign up and join the conversation on Discord