Find answers from the community

Updated 4 months ago

URL reader

At a glance

The community member is trying to extract text from URL links, but the BeautifulSoupWebReader is only returning the header and footer text. The community member asks for help on how to go about this. Another community member suggests trying different loaders from Llamahub, a website that provides web plugins. The second community member then provides sample code for testing various web plugins, including BeautifulSoupWebReader, SimpleWebPageReader, UnstructuredURLLoader, and ReadabilityWebPageReader. However, there is no explicitly marked answer in the comments.

Useful resources
So my use case is that I need to extract the text from some URL links eg :-https://www.datastax.com/guides/what-is-retrieval-augmented-generation?filter=%7B%7D

When I had asked this question earlier you had suggested to use the BeautifulSoup reader. Howveer doing this: -
loader = BeautifulSoupWebReader() documents = loader.load_data(urls=['https://research.ibm.com/blog/retrieval-augmented-generation-RAG'])

is just returning the header and footer text.
How to go about this? Please help
W
L
2 comments
You could try different loader from llamahub if beautifulsoup is not working.

https://llamahub.ai/
Attachment
image.png
I test web plugins like this:
Plain Text
from llama_index import download_loader


def BeautifulSoupWebReader(url):
  BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")
  loader = BeautifulSoupWebReader()
  return loader.load_data(urls=[url])

def SimpleWebPageReader(url):
  SimpleWebPageReader = download_loader("SimpleWebPageReader")
  loader = SimpleWebPageReader()
  return loader.load_data(urls=[url])  

def UnstructuredURLLoader(url):
  UnstructuredURLLoader = download_loader("UnstructuredURLLoader")
  loader = UnstructuredURLLoader(
      urls=[url], 
      continue_on_failure=False, 
      headers={"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Firefox/102.0"}
  )
  return loader.load()

def ReadabilityWebPageReader(url):
  ReadabilityWebPageReader = download_loader("ReadabilityWebPageReader")
  loader = ReadabilityWebPageReader()
  return loader.load_data(url=url)


llmahub_web_plugin = {
    # https://llamahub.ai/l/web-beautiful_soup_web
    "BeautifulSoupWebReader": BeautifulSoupWebReader,
    # https://llamahub.ai/l/web-simple_web
    "SimpleWebPageReader": SimpleWebPageReader ,
    # https://llamahub.ai/l/web-unstructured_web
    "UnstructuredURLLoader": UnstructuredURLLoader,

    # Use Playwright
    # https://llamahub.ai/l/web-readability_web, 
    # "ReadabilityWebPageReader": ReadabilityWebPageReader,
}

for key_web_plugin in llmahub_web_plugin.keys():
    documents = llmahub_web_plugin[key_web_plugin]('https://research.ibm.com/blog/retrieval-augmented-generation-RAG')
    print(f"Llama Hub Web plugin:{key_web_plugin} \t Text length: {len(documents[0].text)}")
Add a reply
Sign up and join the conversation on Discord