Find answers from the community

Updated 3 months ago

I recognize that any decent answer will

I recognize that any decent answer will have plenty of qualifiers, but what do you all use as a "~80% good enough" starting point for building index, when time to index is not a constraint? It's tempting to use fancy chunkers and add in all the extractors. And I know the real answer is to experiment and evaluate whats best for your dataset. But like, is there any "general" recommendation or opinion go-to for what to try after the basic SimpleDirectoryReader + VectorStoreIndex?
j
L
J
5 comments
Every tweet from @jerryjliu0 makes me feel like I need to be trying another RAG pipeline setup πŸ˜‚
lol I mean, my go to so far has litteraly been mostly defaults. And most effort going into the retrieval side (hybrid search, filtering, query rewriting, reranking)

Probably the most interesting augmentation to ingestion is either semantic chunking, or ingesting stuff with llama-parse so that you get either spatial text or markdown text
on the retrieval side, what is the first thing you tune? there are so many knobs between hybrid, fusion with vector + bm25, query rewrites, agentic retrieval, etc. it feels like alchemy
it makes me want to parameterize it all and slap a multi-arm bandit on top
On the ingestion side, I really only do SemanticParsing which works great for me. Otherwise, agreed w/Logan. Defaults seemed to work. My real advancements came with (1) understanding the pipeline better and (2) understanding and building my own multi-agent and stage pipeline instead. 65% of my other time has been on the data model.
Add a reply
Sign up and join the conversation on Discord