Hi all 👋 Im full time developer who is

mmaximilian

Hi all 👋 Im full-time developer who is a new user of LlamaIndex! Looking forward to getting to grips with this tool and building some cool products with it!

I have two main initial questions...
Context
Im using scrapy to scrape online articles. These article's structures are heavily nested and each article has a contents page, with each link of this page taking you either to another set of links/contents page, or the rabbit hole ends and some article text is returned. The text is usually around 1000 characters, but sometimes much shorter (around 300 characters) and sometimes much longer (like a full-blown blog). Also, these bits of text also reference each other with <href>s in the html . I'm storing these bits of article in raw HTML in an RDS which has an id and parent_id and html columns (so the first contents page has NULL for the html column and the parent_id etc etc)

Q1
Whats the best method of indexing this data - in terms of speed, cost and quality? As it follows a hierarchical structure I feel like the TreeIndex makes the most sense, but im not 100% sure. The information here https://gpt-index.readthedocs.io/en/latest/how_to/index_structs/composability.html looks useful however the query_configs are confusing me a little. Does anyone have any initial thoughts about this? Im more than happy to give more info and context!

Q2
I am interested in knowing whether there are any benefits in passing the raw html into LlamaIndex, or if I should strip the text out (or something else?) I know unstructured.io has a parser for html files but as im loading this html from the database im not sure its a direct solution. Links to other documents are included in the html as well as list structures so i dont want to loose any potentially useful information by converting it to text. Does anyone have any thoughts?

Thanks in advance! and i look forward to being part of this community!

2 comments

LLogan M

Q1 -> I would start with just a vector index tbh. It's the cheapest toe build and query, and also the fastest. If it doesn't work well enough, then you could try a tree index 👍 With vector indexes, I usually lower the chunk_size_limit (maybe 1024 or 512), and then query using index.query(...., similarity_top_k=3, response_mode="compact")

Graphs/Composable indexes are definitely another option. If the articles have clear categories or topics, you can group them into indexes (maybe a bunch of vector indexes) and then define a tree or vector index on top of that. But again, maybe a simple vector index will work well enough.

Q2 -> This might take some testing, but I think stripping out the text will give better results (especially for a vector index)

mmaximilian

Awesome, thanks for that. Im working on a tool for the startup i work at so this will be an on going exercise for the coming weeks! Ill let you know how i get on with these bits and update you of progress and any other questions! Thanks a bunch

Add a reply

Find answers from the community

Hi all 👋 Im full time developer who is