I need to pass in the header so that I

HHappyDay

I need to pass in the header so that I can get the content back from a web crawl using BeautifulSoupWebReader. I am getting a 403 and I know I can scrape this page because langchain's webscraper was able to pass in the header. My challenge is I want to use llamaindex and ideally the two documents would be identical types but sadly. Is there a way to pass in the header? I couldn't access the source code to check (or if it is available - my bad, i couldn't find it. This gives 403 for page content:

Plain Text

from llama_index import GPTVectorStoreIndex, download_loader

BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")

loader = BeautifulSoupWebReader()
documents = loader.load_data(urls=['https://www.kirklandreporter.com/tag/football/'])

help appreciated. thank you.

7 comments

mmkern

you could still use the langchain loader, and then convert between the two document formats afterwards

mmkern

if you have a solution for modifying the llamahub loader to suit your needs, it would be a great contribution though!

HHappyDay

is there an easy way to convert between the two? I have not found that. IMHO, it would be easiest if all dataloaders agreed on a document format (metadata). Thank you. I'm still trying different things so will comment again. I very much appreciate your comment. Thank you.

HHappyDay

what i am trying to do is see if i can make a podcast based on local news. One of the sources is our local online paper. I was hoping for a dataloader that could slurp up all the content given the domain name of the online paper. However, the dataloaders feed in URLs so I wrote this code (keep in mind I'm self taught):

Plain Text

from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests

# sections = ['news', 'business', 'sports', 'life', 'opinion', 'calendar', 'obituaries', 'classifieds']
sections = ['news', 'business', 'sports', 'life', 'opinion', 'obituaries', 'classifieds']
all_urls = set()


def get_article_urls(section):
    base_url = 'https://www.kirklandreporter.com/'
    section_url = urljoin(base_url, section)

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }

    response = requests.get(section_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    urls = set()
    for link in soup.find_all('a'):
        href = link.get('href')
        if href and (href.startswith('/') or href.startswith('http')):
            full_url = urljoin(base_url, href)
            urls.add(full_url)
    print(f"Found {len(urls)} in the {section} sections")
    return urls




all_urls = {url for section in sections for url in get_article_urls(section)}


print(f"Collected {len(all_urls)} URLs.")

with open('urls.txt', 'w') as f:
    f.writelines(f"{url}\n" for url in all_urls)

as you can see i feed the urls into a file. THis is where the next step comes in. I want to put all the content into a llamaindex and then form queries to get the text for the podcast. It is a home project my goal being to help our community know more about what is going on.

mmkern

https://gpt-index.readthedocs.io/en/latest/reference/readers.html#llama_index.readers.Document.from_langchain_format

mmkern

this should help to convert into the different types of documents

HHappyDay

thank you. The conversion worked well. Thank you for your help. It is very much appreciated.

Add a reply

Find answers from the community

I need to pass in the header so that I