Find answers from the community

Updated 2 years ago

Extract

At a glance

The community member is looking to convert a PDF to text so they can create a Pandas data frame, similar to the example provided. One community member suggests using the SimpleDirectoryReader, which uses PyPDF under the hood, as a convenient way to read the PDF and pass the text to the evaporate program. Another community member asks how this approach differs from or is better than using dedicated PDF-to-text libraries like pdf2text, Unstructured, and PyPDF. They mention that they might prefer the SimpleDirectoryReader approach because it simplifies dependencies and they trust language models more than potentially hardcoded code for processing unstructured data like PDFs. The community member also clarifies that they want a verbatim PDF-to-text translation.

Useful resources
also, I wanted to convert a pdf to text so I can create a pandas data frame like this: https://gpt-index.readthedocs.io/en/latest/examples/output_parsing/evaporate_program.html anyone knows how to do this?
L
B
5 comments
Yea you can read the pdf in using SimpleDirectoryReader, then you can iterate over the documents and pass the string of each document to the evaporate program
@Logan M how is that different/better than using other (seemingly) dedicated libraries for this:

pdf2text
I think I might go for what you recommended (which I assume is in llama index), because it simplifies depedencies + I trust LLMs more than potentially hardcoded code by humans to process unstructured data like pdfs.
what I want is a verbatim pdf -> text translation.
The simple directory reader uses pypdf under the hood to get the text, its just convenient to use the directory reader.

Really you can use any method you want, as long as you have chunks of text to pass to the evaporate program πŸ‘Œ
Add a reply
Sign up and join the conversation on Discord