Find answers from the community

Home
Members
maximilian
m
maximilian
Offline, last seen 3 months ago
Joined September 25, 2024
Can i use the unstructured data loader straight to aws s3
2 comments
k
Specify number of documents used by a retreiver
2 comments
k
Hello people! Is it possible to build an indev using Llama index and then use it as (or convert it to) a vector store for use by langchain? I know there are some parts that are backwards compatible, but i dont know if this is possible! Thanks in advance
2 comments
m
L
Hi all 👋 Im full-time developer who is a new user of LlamaIndex! Looking forward to getting to grips with this tool and building some cool products with it!


I have two main initial questions...
Context
Im using scrapy to scrape online articles. These article's structures are heavily nested and each article has a contents page, with each link of this page taking you either to another set of links/contents page, or the rabbit hole ends and some article text is returned. The text is usually around 1000 characters, but sometimes much shorter (around 300 characters) and sometimes much longer (like a full-blown blog). Also, these bits of text also reference each other with <href>s in the html . I'm storing these bits of article in raw HTML in an RDS which has an id and parent_id and html columns (so the first contents page has NULL for the html column and the parent_id etc etc)

Q1
Whats the best method of indexing this data - in terms of speed, cost and quality? As it follows a hierarchical structure I feel like the TreeIndex makes the most sense, but im not 100% sure. The information here https://gpt-index.readthedocs.io/en/latest/how_to/index_structs/composability.html looks useful however the query_configs are confusing me a little. Does anyone have any initial thoughts about this? Im more than happy to give more info and context!

Q2
I am interested in knowing whether there are any benefits in passing the raw html into LlamaIndex, or if I should strip the text out (or something else?) I know unstructured.io has a parser for html files but as im loading this html from the database im not sure its a direct solution. Links to other documents are included in the html as well as list structures so i dont want to loose any potentially useful information by converting it to text. Does anyone have any thoughts?

Thanks in advance! and i look forward to being part of this community!
2 comments
m
L
@kapa.ai explain the semantic search and hybrid search
5 comments
k
m
@kapa.ai whats the possible kwargs for as_retriever
5 comments
k
m
Was there any answer to this? I have a similar problem!
4 comments
m
L
im about to spend the next few hours of my friday eve working this out. i shall not rest. (ill have to set it all up with langchain and then try integreate the llama index (right?))
6 comments
L
m
Hi! Im really looking to pick someone's brain about indexing an online document that is structured like the picture attached (but much larger lol!). All the 'leaf node' files are extremely nested (some underneath 4 layers of sections and others on the top level) and each one varies in length (some are short paragraphs, others are large pieces of text). I've tried concatenating the text from all the html files into one document with limited success and I've also tried treating each 'leaf node' as its own document.

One think I do want to be able to do is reference which section was used as context for the answer so the user can follow a link to where the relevant docs.

Has anyone dealt with a file structure like this? Any suggestions on a method or things to try?

nb. to add to the complexity i also have a number of these main documents ('collections') i want to build the index over and also need to build an index over a few of these collections!

fun fun fun!

Thanks in advance for any help of guidance!
2 comments
m
d