Find answers from the community

Updated 10 months ago

I am currently building a Llama index

At a glance

I am currently building a Llama index RAG that processes user requests with unstructured data... most of the time its PDFs.
It all works fine. The only problem i have is that there can be pdfs that arent scanned properly and because i process them
with OCR i get gibberish output which confuses the LLM.

Is there any way in LLama to clean those things? i mean the text extracted from them doesnt make sense at all. Is there any
type of IngestionSanitizer? Or how should i do that?

5 comments

PProvo

@Logan M do you have any idea?

PProvo

@Logan M do you have any idea?

LLogan M

I don't think llama-index has anything specific for this. You could send the text in an LLM call and ask it to clean it up

LLogan M

I would also try using llama-parse if thats an option for you

PProvo

llama parse is no option im afraid as data privacy is extremely important... Yeah i thought so, i will try to ask the LLM if the data makes sense. Thank you!

Add a reply