Find answers from the community

Updated 3 months ago

I need to pass in the header so that I

I need to pass in the header so that I can get the content back from a web crawl using BeautifulSoupWebReader. I am getting a 403 and I know I can scrape this page because langchain's webscraper was able to pass in the header. My challenge is I want to use llamaindex and ideally the two documents would be identical types but sadly. Is there a way to pass in the header? I couldn't access the source code to check (or if it is available - my bad, i couldn't find it. This gives 403 for page content:
Plain Text
from llama_index import GPTVectorStoreIndex, download_loader

BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")

loader = BeautifulSoupWebReader()
documents = loader.load_data(urls=['https://www.kirklandreporter.com/tag/football/'])
help appreciated. thank you.
m
H
7 comments
you could still use the langchain loader, and then convert between the two document formats afterwards
if you have a solution for modifying the llamahub loader to suit your needs, it would be a great contribution though!
is there an easy way to convert between the two? I have not found that. IMHO, it would be easiest if all dataloaders agreed on a document format (metadata). Thank you. I'm still trying different things so will comment again. I very much appreciate your comment. Thank you.
what i am trying to do is see if i can make a podcast based on local news. One of the sources is our local online paper. I was hoping for a dataloader that could slurp up all the content given the domain name of the online paper. However, the dataloaders feed in URLs so I wrote this code (keep in mind I'm self taught):
Plain Text
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests

# sections = ['news', 'business', 'sports', 'life', 'opinion', 'calendar', 'obituaries', 'classifieds']
sections = ['news', 'business', 'sports', 'life', 'opinion', 'obituaries', 'classifieds']
all_urls = set()


def get_article_urls(section):
    base_url = 'https://www.kirklandreporter.com/'
    section_url = urljoin(base_url, section)

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
    }

    response = requests.get(section_url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    urls = set()
    for link in soup.find_all('a'):
        href = link.get('href')
        if href and (href.startswith('/') or href.startswith('http')):
            full_url = urljoin(base_url, href)
            urls.add(full_url)
    print(f"Found {len(urls)} in the {section} sections")
    return urls




all_urls = {url for section in sections for url in get_article_urls(section)}


print(f"Collected {len(all_urls)} URLs.")

with open('urls.txt', 'w') as f:
    f.writelines(f"{url}\n" for url in all_urls)
as you can see i feed the urls into a file. THis is where the next step comes in. I want to put all the content into a llamaindex and then form queries to get the text for the podcast. It is a home project my goal being to help our community know more about what is going on.
this should help to convert into the different types of documents
thank you. The conversion worked well. Thank you for your help. It is very much appreciated.
Add a reply
Sign up and join the conversation on Discord