Find answers from the community

Updated 2 months ago

Summary

Hi everyone, I’m trying to load 6 pdf and use DocumentSummaryIndex to create summary for each of these. The problem is, no matter if I use pdf reader or SimpleDirectoryReader, it always loads each page of each pdf as one document. So I end up getting 29 document objects and thus 29 summaries. However I only want 6 summaries, one for each pdf. Any suggestion?
L
D
9 comments
Thanks Logan, but now there is this error:
Attachment
image.png
I have already installed pytesseract as you cans ee
Did you pass the correct file path though? Colab can be tricky, I usually find the folder in the explorer on the left, right click, and copy path
The error says no docs are loaded
Thanks Logan. You are right. The path is incorrect. I missed the /content part

But now it's still the same:
Attachment
image.png
Ah now I see the tesseract error, whoops

So installing pytesseract is only installing a wrapping around the actual tesseract executable

You need to install tesseract

!apt install tesseract-ocr

https://stackoverflow.com/questions/51696446/tesseract-installation-in-google-colaboratory
thanks Logan! It works now. One pdf is really just one Document Object. Problem is now it's taking forever to load just one pdf. Is it because it's using OCR?
Yea, ocr is not fast 😅 about 1-5 sec per item, and it's applying ocr (the image reader) to each image in the pdf
Add a reply
Sign up and join the conversation on Discord