The post asks about the order in which nodes are returned from the parser for a single document. The comments discuss various aspects of parsing HTML documents, including:
- The arguments to the parser are a list of Document objects containing HTML text, and any metadata on these objects would be inherited by the nodes created.
- The node ID can be anything, although some databases require it to be a UUID.
- If the documents are from the SimpleDirectoryReader, the HTML is already parsed, and the file path is attached to the metadata.
- Community members discuss the ability to parse specific HTML tags and their attributes, and whether this functionality is currently available. Some suggest writing a custom parsing algorithm if specific needs are not met by the existing tools.
There is no explicitly marked answer in the comments.
if the documents are from simple directory reader, the html will already be parsed, no need to use the html node parser (this node parser is intended for raw html text)
simple directory reader will attach the file path in the metadata