Find answers from the community

Updated 4 months ago

Hi! Can you help me to choose a proper

At a glance
The community member is trying to ingest PDF files sent from the frontend using FastAPI. They are unsure about the proper class/loader to use and whether they need to manually read the text contents and create a list of documents. The comments suggest using libraries like unstructured or pdf-miner/pdf-plumber to load the PDF files directly into document objects, as the uploaded files may be in a format that requires them to be on disk to work with existing PDF loaders. The community members are discussing whether there is a loader that can handle the PDF files automatically or if they need to parse them manually.
Hi! Can you help me to choose a proper class/loader, to ingest pdf file(s) sent from frontend?

Plain Text
from fastapi import File, UploadFile
from typing import List

async def upload_files(pdf_files: List[UploadFile] = File(...)):
  for pdf_file in pdf_files:
      # what loader do I need to use? Do I need to first read the text contents myself and create a list of Documents?
L
p
2 comments
I forget what format these uploaded files are in -- to use in an existing pdf loader from llama-hub, they might have to be on disk to work πŸ€”

Probably I would use unstructured or pdf-miner/pdf-plumber directly and load into document objects
@Logan M I think the content itself is a byte array, because to read the actual text, I need to call:

Plain Text
content = await pdf_file.read()


or something like that ( I don't remember too)


What I mean is: do I have to parse them manually using pdf reader library (e.g. pdfplumber, PyPdf etc.) and create a list of Documents myself or is there a loader(s) which can do it automatically?

Seems like the latter option is correct πŸ™‚
Add a reply
Sign up and join the conversation on Discord