Hi all 👋 Im full-time developer who is a new user of LlamaIndex! Looking forward to getting to grips with this tool and building some cool products with it!
I have two main initial questions...
ContextIm using scrapy to scrape online articles. These article's structures are heavily nested and each article has a contents page, with each link of this page taking you either to another set of links/contents page, or the rabbit hole ends and some article text is returned. The text is usually around 1000 characters, but sometimes much shorter (around 300 characters) and sometimes much longer (like a full-blown blog). Also, these bits of text also reference each other with <href>s in the html . I'm storing these bits of article in raw HTML in an RDS which has an id and parent_id and html columns (so the first contents page has NULL for the html column and the parent_id etc etc)
Q1Whats the best method of indexing this data - in terms of speed, cost and quality? As it follows a hierarchical structure I feel like the TreeIndex makes the most sense, but im not 100% sure. The information here
https://gpt-index.readthedocs.io/en/latest/how_to/index_structs/composability.html looks useful however the query_configs are confusing me a little. Does anyone have any initial thoughts about this? Im more than happy to give more info and context!
Q2I am interested in knowing whether there are any benefits in passing the raw html into LlamaIndex, or if I should strip the text out (or something else?) I know unstructured.io has a parser for html files but as im loading this html from the database im not sure its a direct solution. Links to other documents are included in the html as well as list structures so i dont want to loose any potentially useful information by converting it to text. Does anyone have any thoughts?
Thanks in advance! and i look forward to being part of this community!