Find answers from the community

Updated 3 months ago

What is Retrieval Augmented Generation

So my use case is that I need to extract only the body-text from some URL links eg :-https://www.datastax.com/guides/what-is-retrieval-augmented-generation?filter=%7B%7D
Now this link has info in the footer and other data as well. How do I selectively get the relevant body data only?

I looked at the following llama-hub loaders.

Which do you think works the best for my use case?
T
L
2 comments
BeautifulSoup is probably your best option. I'm not sure whether you'll be able to automatically parse out just the main content since the footer text is also wrapped in a similar element. If you just parse the 'p' elements it will be relatively clean but might include some descriptive things from the footer
You can test with Trafilatura
Trafilatura is a Python package designed to gather text on the Web. It includes discovery, extraction and text processing components.
Add a reply
Sign up and join the conversation on Discord