hi! I'm looking for a way to index a collection of PDFs in a way that would enable me later to perform two-hop querying: hop 1: I would like to select relevant PDFs based on document description generated based on the entire PDFs contents hop 2: from selected PDFs, I'd like to select pages based on their descriptions
so both hops will have their lookup implemented using an LLM. if I was doing just hop 2, I think I could just use DocumentSummaryIndex. how hard would it be to set up this two hop process?
Maybe, however, I get a feeling that LlamaIndex has a DX gap to close, especially when I compare it to something like Astro.js. I know Astro is in a completely different and more mature space, but I would definitely look at Astro for an inspiration when it comes to both abstractions and general DX.
if you have any actionable suggestions, would love to know π In general the library is extremely young -- always trying to improve, reduce tech debt, and make it easier for others to contribute
Yes, I keep in mind that LlamaIndex is pre-1.0. Two things come to my mind:
More high-level docs explaining the design of LlamaIndex so one can build a mental model of how LlamaIndex approaches RAG and what to expect from it. Things I'd cover is: how information is flowing, how prompting works, how you approach customizations of prompting, what trade offs and priorities you chose. See: https://docs.astro.build/en/concepts/why-astro/
Better tools for introspection. For example, it would be amazing if you every abstraction like Node, Document, Index, etc. had a method introspection_guide() which would print in a repl/notebook all the ways one can poke at an object and see what's inside. For now, I go to source code and try to work out what properties to print to verify that e.g. DocumentSummaryIndex did what I think it did.