Find answers from the community

Updated 6 months ago

any best practice for scraping web pages

At a glance

The community member is seeking best practices for scraping web pages, specifically how to extract the core content while avoiding the noise from the navbar, header, and footer. They have tried using BeautifulSoupWebReader and SitemapReader, but are open to other methods. Another community member mentions using a scraping service to avoid 403 errors, and the original poster asks for more details on this approach and whether it also solves the "extract the core content" issue.

BByron

any best practice for scraping web pages? i assume we'd want to convert it to plain text, stripping out the html. even so, a big portion of the text ends up being for the navbar, header, footer, etc.

is there any more automated method vs writing a custom parser (ie using beautifulsoup) for each type of page?

🙏

4 comments

BByron

currently i'm using BeautifulSoupWebReader for individual pages, and SitemapReader (with html_to_text=True) to capture domains. but open to any loader that does the job better. thank you!

DDeleted User

i am using a scraping service due to the constant 403 errors

BByron

thanks! can you share which one, and does it solve the "extract the core content" issue, or just the 403 errors?

DDeleted User

scraping <p> with scrapingfish.com, idk what issue youre referring to

Add a reply