Find answers from the community

Updated last year

Best way to scrape XML

Best way to scrape XML?

xml document: https://a2gov.legistar.com/Feed.ashx?M=CalendarDetail&ID=1062177&GUID=C34A240A-927A-4588-928D-77501A644084&Title=City+of+Ann+Arbor+-+Meeting+of+City+Council+on+7%2f17%2f2023+at+7%3a00+PM

I've tried these two, not sure I'm loving the results, maybe the text splitter is bad?

Plain Text
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
            chunk_size=chunk_size,
            chunk_overlap=300)
reader = UnstructuredXMLLoader('./data/agenda.xml')
documents_unstructured = reader.load_and_split(text_splitter=text_splitter)
documents = RssReader().load_data(['https://a2gov.legistar.com/Feed.ashx?M=CalendarDetail&ID=1062179&GUID=72A10A68-6E3E-4A2D-9C1B-15FF554DC60F&Title=City+of+Ann+Arbor+-+Meeting+of+City+Council+on+8%2f21%2f2023+at+7%3a00+PM'])
L
1 comment
hmm, could be the text splitter πŸ€”

Do the documents look ok-ish once you run the loader?

print(documents[0].get_content)
Add a reply
Sign up and join the conversation on Discord