I am currently building a Llama index RAG that processes user requests with unstructured data... most of the time its PDFs. It all works fine. The only problem i have is that there can be pdfs that arent scanned properly and because i process them with OCR i get gibberish output which confuses the LLM.
Is there any way in LLama to clean those things? i mean the text extracted from them doesnt make sense at all. Is there any type of IngestionSanitizer? Or how should i do that?
llama parse is no option im afraid as data privacy is extremely important... Yeah i thought so, i will try to ask the LLM if the data makes sense. Thank you!