The community member is seeking recommendations for the best way to index a website. They are using BeautifulSoup (bs4) to crawl and format documents to SimpleVectorStore, and ChatGPT for the llm_predictor, but the results are suboptimal for larger websites, as the answer is often not found from the context that ChatGPT receives.
The comments suggest that the community member should use the latest versions of llama_index and langchain, as there are some ChatGPT-specific improvements. They also recommend increasing the similarity_top_k parameter in the query call to load more context found from embeddings.
The community members discuss experimenting with smaller chunk sizes, as this has been the most impactful parameter so far. They also mention trying out the knowledge graph and the playground code, which is said to be useful for testing different combinations.
The community members caution that smaller chunk sizes can be more expensive, as embeddings have a fixed size output, and more embeddings are needed per quantity of input data, resulting in more vectors to store.
Hey! What would you recommend as best way to index a website? I am using bs4 to crawl and format documents to SimpleVectorStore and chatGPT for llm_predictor but the results are sub optimal for bigger websites. -> Often the answer is not found from context that chatGPT receives.
I am experimenting with similar now. One thing I didnt expect was, @dagthomas shared a post yesterday with chunk size limit set to 600, and while I need to test a lot more, the smaller token size has been the most impactful parameter so far. I havent tried knowledge graph yet although I am about to try that one. The playground code looks especially useful for testing different combinations: https://github.com/jerryjliu/gpt_index/blob/main/examples/playground/PlaygroundDemo.ipynb
Keep in mind, smaller chunk size can be more expensive as, embeddings have a fixed size output. So if you stuff 4k tokens in an embedding versus 600, you need more embeddings per quantity of input data and accordingly more vectors to store.
@adrianlee2220 I have been eager to try out the playground myself but havent had a chance yet and it may be a few days before I can, if you get the chance to share your experience or any pointers, I would be grateful!