the challenge of parsing diverse pdf documents for a ve...

Hi, I was wondering if someone could assist me with a llamaparse question. I am parsing various pdf documents to embed them into a vector database (I am using Milvus). The vector database needs to have defined fields. The problem I am facing is that each document (not necessarily english) is giving me a different schema when parsed. eg one schema could have a text field with the text that I could import. but another document may have text I need parsed as a text and value field. I am sure this is a common issue. Does someone know of an existing thread I could review to understand the best way to deal with this. I am trying to build a small library that I can access for GPT queries but am stumped at this point.

Hi Logan , I appreciate your reply. I took two documents that I wanted to upload to milvus. I used llamaparse to deliver json versions of both documents. The schema (fields) in both documents were slightly different. The core data was found in text in one document and text and value from the other document. Because I wanted to import the json files into milvus, its created me with an issue as I created three milvus data fields (iD, Embedding and Text) to accept the import form the json fields.

My concern is that I need import the data fields from the documents and I think llamaparse does a good job. So I want to use it. But I am unsure if there is someway to standarise the parse output when using llamaparse. Does this make sense?

Find answers from the community

the challenge of parsing diverse pdf documents for a vector database