Find answers from the community

Updated 7 months ago

I recognize that any decent answer will

At a glance

I recognize that any decent answer will have plenty of qualifiers, but what do you all use as a "~80% good enough" starting point for building index, when time to index is not a constraint? It's tempting to use fancy chunkers and add in all the extractors. And I know the real answer is to experiment and evaluate whats best for your dataset. But like, is there any "general" recommendation or opinion go-to for what to try after the basic SimpleDirectoryReader + VectorStoreIndex?

5 comments

jjoey

Every tweet from @jerryjliu0 makes me feel like I need to be trying another RAG pipeline setup 😂

LLogan M

lol I mean, my go to so far has litteraly been mostly defaults. And most effort going into the retrieval side (hybrid search, filtering, query rewriting, reranking)

Probably the most interesting augmentation to ingestion is either semantic chunking, or ingesting stuff with llama-parse so that you get either spatial text or markdown text

jjoey

on the retrieval side, what is the first thing you tune? there are so many knobs between hybrid, fusion with vector + bm25, query rewrites, agentic retrieval, etc. it feels like alchemy

jjoey

it makes me want to parameterize it all and slap a multi-arm bandit on top

JJasonV

On the ingestion side, I really only do SemanticParsing which works great for me. Otherwise, agreed w/Logan. Defaults seemed to work. My real advancements came with (1) understanding the pipeline better and (2) understanding and building my own multi-agent and stage pipeline instead. 65% of my other time has been on the data model.

Add a reply