any thoughts on embeddings for a codebase? should that be a vector index for each file (with a generated summary for each), and a giant list index on top? or is there a better way to design the indices? the goal is to have Q&A with an LLM on specific code as well as high-level questions, e.g., "how do files X and Y implement Z?"
I dont know if trying to vectorize by file makes sense as it may likely need to chunk your data anyway. And I have found in testing that chunk size can be one of the more powerful levers. I have been getting better results from medium sized chunks than larger ones. I am still new to trying to do this with codebases so cant say which are best, but some pointers are, make sure all of your code files are well annotated, if not, use gpt to fully annotate them first. 2nd tip, select a parser that works well with code - langchain has a lot of different parsers and you may get different results testing with different parsers, but a good general parser for code is RecursiveCharacterTextSplitter which only splits on double newline and so doesnt break your functions so long as you dont leave a bunch of random whitespace