Find answers from the community

Updated 8 months ago

I am currently building a Llama index

I am currently building a Llama index RAG that processes user requests with unstructured data... most of the time its PDFs.
It all works fine. The only problem i have is that there can be pdfs that arent scanned properly and because i process them
with OCR i get gibberish output which confuses the LLM.

Is there any way in LLama to clean those things? i mean the text extracted from them doesnt make sense at all. Is there any
type of IngestionSanitizer? Or how should i do that?
P
L
5 comments
@Logan M do you have any idea?
@Logan M do you have any idea?
I don't think llama-index has anything specific for this. You could send the text in an LLM call and ask it to clean it up
I would also try using llama-parse if thats an option for you
llama parse is no option im afraid as data privacy is extremely important... Yeah i thought so, i will try to ask the LLM if the data makes sense. Thank you!
Add a reply
Sign up and join the conversation on Discord