Find answers from the community

Updated 6 months ago

And what is the order in which nodes are

And what is the order in which nodes are returned from the parser for a single document?
L
d
19 comments
I would take a look at the source code for most of these questions

The arguments aren't a list of files, but a list of Document objects containing HTML text

Any metadata on these document objects would be inherited to the nodes created
The node id is just a UUID, but it can really be anything (although some vector dbs requires it to a be a UUID)
Yes we'll that's those docs are from SimpleDirectoryReader so does it have a means to attach a prefix to the file path and store in documents?
Store url prefix in document metadata?
Is there anyway to do this?
if the documents are from simple directory reader, the html will already be parsed, no need to use the html node parser (this node parser is intended for raw html text)

simple directory reader will attach the file path in the metadata
I don't know it seems to be finding html tags in the files
Plus it allows per tag inclusion list which is quite useful
I am able to parse out nodes just for what I need basically
In some cases I would like to examine not just the tag is h2, p, div but also the attributes of the tag
I don't think I see that available maybe there is a setting for including attribute metadata?
I don't think thats available at the moment
I would recommend just writing your own parsing algorithm if you have specific needs tbh
Its nothing too crazy
Well I think that's the only thing missing actually
I am doing my own algorithm but over HTMLNodeParser don't want to reinvent the wheel
Attributes would be all that's needed
I will take a look thanks
Add a reply
Sign up and join the conversation on Discord