Thanks for your inputs! 95% of my documents are powerpoints, so i was planning on chunking slide by slide and generating an embedding per slide. is that the same concept as using sentence transformers?
my main question though is what should the GPT-Index index structure be? because of the vast amount of data, would I need to go in a mult-level tree direction? would this hinder performance?
i think ANN is pretty good for these vector stores. you should try the "naive" approach and then refine as necessary. simple is always easier to maintain π