Hello, I've a question on loading html files. I'm following the tutorial here (
https://github.com/jerryjliu/llama_index/blob/main/examples/chatbot/Chatbot_SEC.ipynb), but with my own html file. However, I'm getting this error for some html files:
INFO:unstructured:Reading document from string ...
INFO:unstructured:Reading document ...
Traceback (most recent call last):
File "/Users/user/crawl/index.py", line 14, in <module>
html = loader.load_data(file=Path(f'./output1.html'))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/crawl/venv/lib/python3.11/site-packages/llama_index/readers/llamahub_modules/file/unstructured/base.py", line 36, in load_data
elements = partition(str(file))
^^^^^^^^^^^^^^^^^^^^
File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/partition/auto.py", line 86, in partition
elements = partition_html(
^^^^^^^^^^^^^^^
File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/partition/html.py", line 85, in partition_html
layout_elements = document_to_element_list(document, include_page_breaks=include_page_breaks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/partition/common.py", line 71, in document_to_element_list
num_pages = len(document.pages)
^^^^^^^^^^^^^^
File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/documents/xml.py", line 52, in pages
self._pages = self._read()
^^^^^^^^^^^^
File "/Users/user/crawl/venv/lib/python3.11/site-packages/unstructured/documents/html.py", line 101, in _read
etree.strip_elements(self.document_tree, ["script"])
File "src/lxml/cleanup.pxi", line 100, in lxml.etree.strip_elements
File "src/lxml/apihelpers.pxi", line 41, in lxml.etree._documentOrRaise
TypeError: Invalid input object: NoneType