The community member is using UnstructuredElementNodeParser to split documents into text nodes and index nodes, which works well with HTML documents. However, they are facing an issue where the parser fails to split tables into index nodes when working with PDF documents. The comments suggest that PDF parsing can be challenging, and the UnstructuredElementNodeParser may not be properly identifying the tables in the PDF. A community member suggests trying a tool like Camelot to extract the tables, but notes that if UnstructuredElementNodeParser is failing, Camelot might also have difficulties.
Hi Team, I was using UnstructuredElementNodeParser to split document into text nodes and index nodes. This works really well with html documents (after loading them using FlatReader). However, it fails to split tables into index nodes when we do it with pdf documents (after loading them using PDFReader. Is there a potential way to solve this issue? Thanks in advance.