And what is the order in which nodes are

At a glance

The post asks about the order in which nodes are returned from the parser for a single document. The comments discuss various aspects of parsing HTML documents, including:

- The arguments to the parser are a list of Document objects containing HTML text, and any metadata on these objects would be inherited by the nodes created.

- The node ID can be anything, although some databases require it to be a UUID.

- If the documents are from the SimpleDirectoryReader, the HTML is already parsed, and the file path is attached to the metadata.

- Community members discuss the ability to parse specific HTML tags and their attributes, and whether this functionality is currently available. Some suggest writing a custom parsing algorithm if specific needs are not met by the existing tools.

There is no explicitly marked answer in the comments.

Useful resources

ddean

And what is the order in which nodes are returned from the parser for a single document?

19 comments

LLogan M

I would take a look at the source code for most of these questions

The arguments aren't a list of files, but a list of Document objects containing HTML text

Any metadata on these document objects would be inherited to the nodes created

LLogan M

The node id is just a UUID, but it can really be anything (although some vector dbs requires it to a be a UUID)

ddean

Yes we'll that's those docs are from SimpleDirectoryReader so does it have a means to attach a prefix to the file path and store in documents?

ddean

Store url prefix in document metadata?

ddean

Is there anyway to do this?

LLogan M

if the documents are from simple directory reader, the html will already be parsed, no need to use the html node parser (this node parser is intended for raw html text)

simple directory reader will attach the file path in the metadata

ddean

I don't know it seems to be finding html tags in the files

ddean

Plus it allows per tag inclusion list which is quite useful

ddean

I am able to parse out nodes just for what I need basically

ddean

In some cases I would like to examine not just the tag is h2, p, div but also the attributes of the tag

ddean

I don't think I see that available maybe there is a setting for including attribute metadata?

LLogan M

I don't think thats available at the moment