Find answers from the community

Updated 3 months ago

Summary

Hi everyone, I’m trying to load 6 pdf and use DocumentSummaryIndex to create summary for each of these. The problem is, no matter if I use pdf reader or SimpleDirectoryReader, it always loads each page of each pdf as one document. So I end up getting 29 document objects and thus 29 summaries. However I only want 6 summaries, one for each pdf. Any suggestion?

9 comments

LLogan M

Maybe try a different pdf loader?

https://llamahub.ai/l/file-flat_pdf

DDokmy

Thanks Logan, but now there is this error:

Attachment

DDokmy

I have already installed pytesseract as you cans ee

LLogan M

Did you pass the correct file path though? Colab can be tricky, I usually find the folder in the explorer on the left, right click, and copy path

LLogan M

The error says no docs are loaded

DDokmy

Thanks Logan. You are right. The path is incorrect. I missed the /content part

But now it's still the same:

Attachment

LLogan M

Ah now I see the tesseract error, whoops

So installing pytesseract is only installing a wrapping around the actual tesseract executable

You need to install tesseract

!apt install tesseract-ocr

https://stackoverflow.com/questions/51696446/tesseract-installation-in-google-colaboratory

DDokmy

thanks Logan! It works now. One pdf is really just one Document Object. Problem is now it's taking forever to load just one pdf. Is it because it's using OCR?

LLogan M

Yea, ocr is not fast 😅 about 1-5 sec per item, and it's applying ocr (the image reader) to each image in the pdf

Add a reply