any best practice for scraping web pages? i assume we'd want to convert it to plain text, stripping out the html. even so, a big portion of the text ends up being for the navbar, header, footer, etc.
is there any more automated method vs writing a custom parser (ie using beautifulsoup) for each type of page?
currently i'm using BeautifulSoupWebReader for individual pages, and SitemapReader (with html_to_text=True) to capture domains. but open to any loader that does the job better. thank you!