Find answers from the community

Updated 3 months ago

Automated retrieval with llama-index

Automated retrieval with llama-index blocked. Is there a work-around?

Plain Text
from llama_index.core import SummaryIndex
from llama_index.readers.web import SimpleWebPageReader
from IPython.display import Markdown, display
import os

documents = SimpleWebPageReader(html_to_text=True).load_data(["https://www.xyz.com"])
documents


[MyHomepage] Main\nContent Main Navigation\n\n## Page not available\n\nYour access to website has been blocked because you are using an\nautomated process to retrieve content\n\nReason: Automated retrieval by user agent "python-requests/2.31.0".\n\nURL:
R
1 comment
you might wanna try the UnstructuredURLLoader then you can change the user agent too as the default agent is being blocked.
Plain Text
loader = UnstructuredURLLoader(urls=urls_to_fetch, headers={"user-agent": "custom user agent"})

documents = loader.load_data()
Add a reply
Sign up and join the conversation on Discord