Hi, I am loading GitHub repos into llamaindex and using GPTVectorStoreIndex for indexing. I am trying to get the GPT to help me answer questions about the code. I am getting okay results, but it could be a lot better. It works well if the question about the code is present in the documentation, but outside of documented code it isn't doing a good job.
Would you happen to know if there is a better way of doing this? π
I thought ListIndex might be better, but it takes 5+ minutes to get an answer and I am also running into context limit errors, I would probably have to use some larger 32K context model for example. But that would get quite expensive.
Hmm there shouldn't be context window errors, at least with default settings
In Amy case, I'm pretty sure the github reader doesn't actually load source code, only text files like markdown
Code is super tricky to work with tbh. You have to be careful to not chunk functions in half. LlamaIndex takes a lot of work to work well with code from what I've seen π€
Hmm there shouldn't be context window errors, at least with default settings
I guess I must have done something wrong then π Am I correct in believing that ListIndex would work best for this use case? If yes, I might have to give it another try.
In Amy case, I'm pretty sure the github reader doesn't actually load source code, only text files like markdown
Wait, github reader doesn't load source code files like .py, .js etc? Are you sure? It appears that it can do that from the image I sent. I haven't specific this parameter filter_file_extensions = ([".py"], GithubRepositoryReader.FilterType.INCLUDE) in my code during building of the documents though, so I wonder if by default it reads only text files like you said? π€
Code is super tricky to work with tbh. You have to be careful to not chunk functions in half. LlamaIndex takes a lot of work to work well with code from what I've seen :thinking:
I see, well I'm probably not smart enough yet to make it work haha π
I still don't think a list index is a good application for this though. The ideal solution is probably a vector index, with some complicated custom retriever code lol