I've been working on integrating LlamaIndex with JSON files that have been extracted from PDFs. For context, each text element from the PDF has been chunked, assigned an ID, and given a parent ID to refer to its parent element, making the text structure clear and hierarchical.
However, I've encountered issues whether I use the JsonReader from LlamaHub, the VectorStoreIndex for index building, and build a general query_engine or use the JsonQueryEngine. It seems LlamaIndex struggles to interpret the 'id' and 'parentId' even when the schema is provided. Additionally, it doesn't seem to handle more flexible queries like summarizing all documents effectively.
Here's a sample of the JSON structure I'm working with. Has anyone experienced this or have suggestions on how to improve the integration? Thanks in advance!
There is an error when I am querying "what's the file about " with JsonQueryEngine : 82 def p_error(self, t): ---> 83 raise JsonPathParserError('Parse error at %s:%s near token %s (%s)' 84 % (t.lineno, t.col, t.value, t.type)) 85
JsonPathParserError: Parse error at 1:4 near token task (ID)
And when I was using general query engine with "What's the parent text of C.2", I always got the response like "The information is not provided."