Find answers from the community

Updated 10 months ago

And what is the order in which nodes are

At a glance

The post asks about the order in which nodes are returned from the parser for a single document. The comments discuss various aspects of parsing HTML documents, including:

- The arguments to the parser are a list of Document objects containing HTML text, and any metadata on these objects would be inherited by the nodes created.

- The node ID can be anything, although some databases require it to be a UUID.

- If the documents are from the SimpleDirectoryReader, the HTML is already parsed, and the file path is attached to the metadata.

- Community members discuss the ability to parse specific HTML tags and their attributes, and whether this functionality is currently available. Some suggest writing a custom parsing algorithm if specific needs are not met by the existing tools.

There is no explicitly marked answer in the comments.

Useful resources
And what is the order in which nodes are returned from the parser for a single document?
L
d
19 comments
I would take a look at the source code for most of these questions

The arguments aren't a list of files, but a list of Document objects containing HTML text

Any metadata on these document objects would be inherited to the nodes created
The node id is just a UUID, but it can really be anything (although some vector dbs requires it to a be a UUID)
Yes we'll that's those docs are from SimpleDirectoryReader so does it have a means to attach a prefix to the file path and store in documents?
Store url prefix in document metadata?
Is there anyway to do this?
if the documents are from simple directory reader, the html will already be parsed, no need to use the html node parser (this node parser is intended for raw html text)

simple directory reader will attach the file path in the metadata
I don't know it seems to be finding html tags in the files
Plus it allows per tag inclusion list which is quite useful
I am able to parse out nodes just for what I need basically
In some cases I would like to examine not just the tag is h2, p, div but also the attributes of the tag
I don't think I see that available maybe there is a setting for including attribute metadata?
I don't think thats available at the moment
I would recommend just writing your own parsing algorithm if you have specific needs tbh
Its nothing too crazy
Well I think that's the only thing missing actually
I am doing my own algorithm but over HTMLNodeParser don't want to reinvent the wheel
Attributes would be all that's needed
I will take a look thanks
Add a reply
Sign up and join the conversation on Discord