Automated retrieval with llama-index

At a glance

The community member's post indicates that their automated retrieval using llama-index is being blocked, and they are seeking a workaround. A comment suggests trying the UnstructuredURLLoader and changing the user agent, as the default agent is being blocked.

Useful resources

kkev

Automated retrieval with llama-index blocked. Is there a work-around?

Plain Text

from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display
import os

documents = SimpleWebPageReader(html_to_text=True).load_data(["https://www.xyz.com"])
documents

[MyHomepage] Main\nContent Main Navigation\n\n## Page not available\n\nYour access to website has been blocked because you are using an\nautomated process to retrieve content\n\nReason: Automated retrieval by user agent "python-requests/2.31.0".\n\nURL:

1 comment

RRohan

you might wanna try the UnstructuredURLLoader then you can change the user agent too as the default agent is being blocked.

Plain Text

loader = UnstructuredURLLoader(urls=urls_to_fetch, headers={"user-agent": "custom user agent"})

documents = loader.load_data()

Add a reply

Find answers from the community

Automated retrieval with llama-index