what i am trying to do is see if i can make a podcast based on local news. One of the sources is our local online paper. I was hoping for a dataloader that could slurp up all the content given the domain name of the online paper. However, the dataloaders feed in URLs so I wrote this code (keep in mind I'm self taught):
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
# sections = ['news', 'business', 'sports', 'life', 'opinion', 'calendar', 'obituaries', 'classifieds']
sections = ['news', 'business', 'sports', 'life', 'opinion', 'obituaries', 'classifieds']
all_urls = set()
def get_article_urls(section):
base_url = 'https://www.kirklandreporter.com/'
section_url = urljoin(base_url, section)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(section_url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
urls = set()
for link in soup.find_all('a'):
href = link.get('href')
if href and (href.startswith('/') or href.startswith('http')):
full_url = urljoin(base_url, href)
urls.add(full_url)
print(f"Found {len(urls)} in the {section} sections")
return urls
all_urls = {url for section in sections for url in get_article_urls(section)}
print(f"Collected {len(all_urls)} URLs.")
with open('urls.txt', 'w') as f:
f.writelines(f"{url}\n" for url in all_urls)
as you can see i feed the urls into a file. THis is where the next step comes in. I want to put all the content into a llamaindex and then form queries to get the text for the podcast. It is a home project my goal being to help our community know more about what is going on.