Find answers from the community

Updated 3 months ago

Extracting Information from Websites to Answer Queries

At a glance

The post asks about the best way to extract information from a website and respond to queries about the website and its links. The comments provide some suggestions:

A community member recommends using a library like BeautifulSoup to scrape websites, filter the data appropriately, and create Document objects to build a VectorStore. They note that this is a baseline approach and there are many ways to improve performance.

Another community member has tried this approach but is not getting good performance, so they are asking for a better way to get information from a website and respond, as they are building a chatbot to provide information about a company from the company's website.

A third community member suggests trying different approaches like SubQuestionQuerEngine, Reranking, and Finetuning to improve performance. They emphasize the importance of the data ingestion step and thinking about whether a human would be able to understand the information being processed. They also mention that a good system prompt can sometimes help.

There is no explicitly marked answer in the comments.

What is the best way to extract information from website and give answer to queries regarding the website and links in them
S
P
3 comments
If you want to use RAG:
  • Use some library like BeautifulSoup to scrape your websites.
  • Filter your data appropriately.
  • Convert it to Markdown or use the HTMLNodeParser directly.
  • Create your Document objects and build your VectorStore on top of it.
  • Enjoy!
(You can do a lot of extras to improve the performance, but this should give you a baseline to work on)
I have used this way. But I am not getting good performance. So, I am asking better way to get information from website and respond as I am building a chat bot to a website that gives information about the company from the company’s website
Just try different stuff, SubQuestionQuerEngine, Reranking, Finetuning - All options to improve the performance. What is always pretty important is the ingestion step, so how good is the data you are scanning? You always need to think about whether you as a human would be able to understand what is going on. A good system prompt can also sometimes do wonders.
Add a reply
Sign up and join the conversation on Discord