Hey! What would you recommend as best way to index a website? I am using bs4 to crawl and format documents to SimpleVectorStore and chatGPT for llm_predictor but the results are sub optimal for bigger websites. -> Often the answer is not found from context that chatGPT receives.
I am experimenting with similar now. One thing I didnt expect was, @dagthomas shared a post yesterday with chunk size limit set to 600, and while I need to test a lot more, the smaller token size has been the most impactful parameter so far. I havent tried knowledge graph yet although I am about to try that one. The playground code looks especially useful for testing different combinations: https://github.com/jerryjliu/gpt_index/blob/main/examples/playground/PlaygroundDemo.ipynb
Keep in mind, smaller chunk size can be more expensive as, embeddings have a fixed size output. So if you stuff 4k tokens in an embedding versus 600, you need more embeddings per quantity of input data and accordingly more vectors to store.
@adrianlee2220 I have been eager to try out the playground myself but havent had a chance yet and it may be a few days before I can, if you get the chance to share your experience or any pointers, I would be grateful!