Find answers from the community

Updated 2 years ago

Hey What would you recommend as best way

Hey! What would you recommend as best way to index a website?
I am using bs4 to crawl and format documents to SimpleVectorStore and chatGPT for llm_predictor but the results are sub optimal for bigger websites.
-> Often the answer is not found from context that chatGPT receives.
L
a
a
8 comments
Are you using the latest llama_index and langchain versions? There are some ChatGPT specific improvements in both. There's also a small demo here: https://github.com/jerryjliu/gpt_index/blob/main/examples/vector_indices/SimpleIndexDemo-ChatGPT.ipynb

If you are still encountering issues, you might need to increase similarity_top_k in your query call
so increasing similarity_top_k loads more than one context found from embeddings?
yeah I updated to 0.4.22 today, will check langchain also
I am experimenting with similar now. One thing I didnt expect was, @dagthomas shared a post yesterday with chunk size limit set to 600, and while I need to test a lot more, the smaller token size has been the most impactful parameter so far. I havent tried knowledge graph yet although I am about to try that one. The playground code looks especially useful for testing different combinations: https://github.com/jerryjliu/gpt_index/blob/main/examples/playground/PlaygroundDemo.ipynb
Keep in mind, smaller chunk size can be more expensive as, embeddings have a fixed size output. So if you stuff 4k tokens in an embedding versus 600, you need more embeddings per quantity of input data and accordingly more vectors to store.
Hey that playground code looks very useful to trying to figure this out, thanks mate
Exactly! By default, it's set to one
@adrianlee2220 I have been eager to try out the playground myself but havent had a chance yet and it may be a few days before I can, if you get the chance to share your experience or any pointers, I would be grateful!
Add a reply
Sign up and join the conversation on Discord