Log in
Log into community
Find answers from the community
View all posts
Related posts
Did this answer your question?
๐
๐
๐
Powered by
Hall
Inactive
Updated 6 months ago
0
Follow
And what is the order in which nodes are
And what is the order in which nodes are
Inactive
0
Follow
d
dean
6 months ago
ยท
And what is the order in which nodes are returned from the parser for a single document?
L
d
19 comments
Share
Open in Discord
L
Logan M
6 months ago
I would take a look at the source code for most of these questions
The arguments aren't a list of files, but a list of
Document
objects containing HTML text
Any metadata on these document objects would be inherited to the nodes created
L
Logan M
6 months ago
The node id is just a UUID, but it can really be anything (although some vector dbs requires it to a be a UUID)
d
dean
6 months ago
Yes we'll that's those docs are from SimpleDirectoryReader so does it have a means to attach a prefix to the file path and store in documents?
d
dean
6 months ago
Store url prefix in document metadata?
d
dean
6 months ago
Is there anyway to do this?
L
Logan M
6 months ago
if the documents are from simple directory reader, the html will already be parsed, no need to use the html node parser (this node parser is intended for raw html text)
simple directory reader will attach the file path in the metadata
d
dean
6 months ago
I don't know it seems to be finding html tags in the files
d
dean
6 months ago
Plus it allows per tag inclusion list which is quite useful
d
dean
6 months ago
I am able to parse out nodes just for what I need basically
d
dean
6 months ago
In some cases I would like to examine not just the tag is h2, p, div but also the attributes of the tag
d
dean
6 months ago
I don't think I see that available maybe there is a setting for including attribute metadata?
L
Logan M
6 months ago
I don't think thats available at the moment
L
Logan M
6 months ago
I would recommend just writing your own parsing algorithm if you have specific needs tbh
L
Logan M
6 months ago
Its nothing too crazy
d
dean
6 months ago
Well I think that's the only thing missing actually
d
dean
6 months ago
I am doing my own algorithm but over HTMLNodeParser don't want to reinvent the wheel
d
dean
6 months ago
Attributes would be all that's needed
L
Logan M
6 months ago
The code is pretty straightforward
https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/node_parser/file/html.py
PRs are welcome
d
dean
6 months ago
I will take a look thanks
Add a reply
Sign up and join the conversation on Discord
Join on Discord