Find answers from the community

Updated 3 months ago

any best practice for scraping web pages

any best practice for scraping web pages? i assume we'd want to convert it to plain text, stripping out the html. even so, a big portion of the text ends up being for the navbar, header, footer, etc.

is there any more automated method vs writing a custom parser (ie using beautifulsoup) for each type of page?

πŸ™
B
D
4 comments
currently i'm using BeautifulSoupWebReader for individual pages, and SitemapReader (with html_to_text=True) to capture domains. but open to any loader that does the job better. thank you!
i am using a scraping service due to the constant 403 errors
thanks! can you share which one, and does it solve the "extract the core content" issue, or just the 403 errors?
scraping <p> with scrapingfish.com, idk what issue youre referring to
Add a reply
Sign up and join the conversation on Discord