Find answers from the community

Updated 2 years ago

Hey What would you recommend as best way

At a glance

The community member is seeking recommendations for the best way to index a website. They are using BeautifulSoup (bs4) to crawl and format documents to SimpleVectorStore, and ChatGPT for the llm_predictor, but the results are suboptimal for larger websites, as the answer is often not found from the context that ChatGPT receives.

The comments suggest that the community member should use the latest versions of llama_index and langchain, as there are some ChatGPT-specific improvements. They also recommend increasing the similarity_top_k parameter in the query call to load more context found from embeddings.

The community members discuss experimenting with smaller chunk sizes, as this has been the most impactful parameter so far. They also mention trying out the knowledge graph and the playground code, which is said to be useful for testing different combinations.

The community members caution that smaller chunk sizes can be more expensive, as embeddings have a fixed size output, and more embeddings are needed per quantity of input data, resulting in more vectors to store.

Useful resources
Hey! What would you recommend as best way to index a website?
I am using bs4 to crawl and format documents to SimpleVectorStore and chatGPT for llm_predictor but the results are sub optimal for bigger websites.
-> Often the answer is not found from context that chatGPT receives.
L
a
a
8 comments
Are you using the latest llama_index and langchain versions? There are some ChatGPT specific improvements in both. There's also a small demo here: https://github.com/jerryjliu/gpt_index/blob/main/examples/vector_indices/SimpleIndexDemo-ChatGPT.ipynb

If you are still encountering issues, you might need to increase similarity_top_k in your query call
so increasing similarity_top_k loads more than one context found from embeddings?
yeah I updated to 0.4.22 today, will check langchain also
I am experimenting with similar now. One thing I didnt expect was, @dagthomas shared a post yesterday with chunk size limit set to 600, and while I need to test a lot more, the smaller token size has been the most impactful parameter so far. I havent tried knowledge graph yet although I am about to try that one. The playground code looks especially useful for testing different combinations: https://github.com/jerryjliu/gpt_index/blob/main/examples/playground/PlaygroundDemo.ipynb
Keep in mind, smaller chunk size can be more expensive as, embeddings have a fixed size output. So if you stuff 4k tokens in an embedding versus 600, you need more embeddings per quantity of input data and accordingly more vectors to store.
Hey that playground code looks very useful to trying to figure this out, thanks mate
Exactly! By default, it's set to one
@adrianlee2220 I have been eager to try out the playground myself but havent had a chance yet and it may be a few days before I can, if you get the chance to share your experience or any pointers, I would be grateful!
Add a reply
Sign up and join the conversation on Discord