Hi! Can you help me to choose a proper

At a glance

The community member is trying to ingest PDF files sent from the frontend using FastAPI. They are unsure about the proper class/loader to use and whether they need to manually read the text contents and create a list of documents. The comments suggest using libraries like unstructured or pdf-miner/pdf-plumber to load the PDF files directly into document objects, as the uploaded files may be in a format that requires them to be on disk to work with existing PDF loaders. The community members are discussing whether there is a loader that can handle the PDF files automatically or if they need to parse them manually.

ppikachu8887867

Hi! Can you help me to choose a proper class/loader, to ingest pdf file(s) sent from frontend?

Plain Text

from fastapi import File, UploadFile
from typing import List

async def upload_files(pdf_files: List[UploadFile] = File(...)):
  for pdf_file in pdf_files:
      # what loader do I need to use? Do I need to first read the text contents myself and create a list of Documents?

2 comments

LLogan M

I forget what format these uploaded files are in -- to use in an existing pdf loader from llama-hub, they might have to be on disk to work 🤔

Probably I would use unstructured or pdf-miner/pdf-plumber directly and load into document objects

ppikachu8887867

@Logan M I think the content itself is a byte array, because to read the actual text, I need to call:

Plain Text

content = await pdf_file.read()

or something like that ( I don't remember too)

What I mean is: do I have to parse them manually using pdf reader library (e.g. pdfplumber, PyPdf etc.) and create a list of Documents myself or is there a loader(s) which can do it automatically?

Seems like the latter option is correct 🙂

Add a reply

Find answers from the community

Hi! Can you help me to choose a proper