The community member is seeking best practices for scraping web pages, specifically how to extract the core content while avoiding the noise from the navbar, header, and footer. They have tried using BeautifulSoupWebReader and SitemapReader, but are open to other methods. Another community member mentions using a scraping service to avoid 403 errors, and the original poster asks for more details on this approach and whether it also solves the "extract the core content" issue.
any best practice for scraping web pages? i assume we'd want to convert it to plain text, stripping out the html. even so, a big portion of the text ends up being for the navbar, header, footer, etc.
is there any more automated method vs writing a custom parser (ie using beautifulsoup) for each type of page?
currently i'm using BeautifulSoupWebReader for individual pages, and SitemapReader (with html_to_text=True) to capture domains. but open to any loader that does the job better. thank you!