Find answers from the community

Updated last year

Hello

Hello!

I'm using RssReader().load_data([url]) to load an xml file. Currently it's creating a document per line item and each get_content() of the document is only like 169 characters. Is this the optimal way? especially because I'm going to be loading a lot more xml's and expect them to be searchable (also, i want to change what metadata they're automatically using).

Should I basically create my own loader and use my own node parser?
L
b
14 comments
Sounds like you might need to create your own reader πŸ˜… It's a little hard to generalize a loader across all types of XML

Happy to help write one with you, but it shouldn't be too bad. Could use the existing RSS reader as a reference
https://github.com/emptycrown/llama-hub/blob/main/llama_hub/web/rss/base.py
ok that's what i thought.
so @Logan M if each document is a line item in agenda
and I have 1000 agendas...
and I want to ask a question like "What items are most likely to be in the next meeting" or something or "Give me all items that involve construction"
A vector db is correct usage but top_k_similarity = 5 is not going to be helpful
because there's 1000's of line items?
Hmm the first question seems like a time based thing? I.e. Given the last X agendas, what's next?

The second kind of points to something I was mentioning earlier -- extracting some kind of schema across your data beforehand, to enabled something like text2sql

Alternatively, you could implement a keyword search across your documents, that an agent could decide to use? πŸ˜… or something like that
by text2sql, do you mean converting the documents and inserting them into sql
so we can do actual queries on them.
Yea, so like for example defining some kind of structured schema, using a pydantic program or similar to extract that schema, and then inserting into a db

Tbh it's a feature I've been wanting to add to the library at some point, it feels powerful lol
yeah..... interesting but also in my case if it's xml I could just insert the agenda items right into sql
and then maybe use openai to add keywords about the industry the line items pertain to
Add a reply
Sign up and join the conversation on Discord