What is Retrieval Augmented Generation

At a glance

So my use case is that I need to extract only the body-text from some URL links eg :-https://www.datastax.com/guides/what-is-retrieval-augmented-generation?filter=%7B%7D
Now this link has info in the footer and other data as well. How do I selectively get the relevant body data only?

I looked at the following llama-hub loaders.

https://llamahub.ai/l/web-beautiful_soup_web
https://llamahub.ai/l/web-async_web
https://llamahub.ai/l/web-unstructured_web

Which do you think works the best for my use case?

2 comments

TTeemu

BeautifulSoup is probably your best option. I'm not sure whether you'll be able to automatically parse out just the main content since the footer text is also wrapped in a similar element. If you just parse the 'p' elements it will be relatively clean but might include some descriptive things from the footer

LLeMoussel

You can test with Trafilatura
Trafilatura is a Python package designed to gather text on the Web. It includes discovery, extraction and text processing components.

Add a reply

Find answers from the community

What is Retrieval Augmented Generation