Find answers from the community

Updated 3 days ago

Improving Retrieval in a RAG Pipeline with Image Extraction from PDF Files

Hello everyone! I’m working on improving my RAG pipeline by extracting images from my PDF files. While I haven’t encountered significant challenges in the ingestion and indexing phases, I’m a bit uncertain when it comes to retrieval.

Currently, retrieval is handled through tool calls, allowing the model to determine when additional information is needed to answer a user’s query. I’m using GPT-4o via OpenAI’s API, but since the output can only contain text and not images, I’m facing a limitation. My goal is to pass images—if present in the retrieved chunks—to enhance the quality of responses.

What would be the best way to overcome OpenAI’s API constraints? Has anyone else faced a similar issue? If so, how did you resolve it?

I've also attached an example of an API call I attempted, but it didn’t work as expected.
L
f
T
6 comments
yea, openai has some annoying limitations on tool outputs. Images can only be from users.

You can probably just change the tool output to "image retrieved" and then put a user message next with the image
I hope they lift that constraint at some point, seems silly
I completely agree with you, I fail to see the logic behind it.
Do you have any experience with the solution you proposed? I have considered it myself, but it doesn’t quite convince me. I lack concrete data to support my argument, yet I have the impression that it increases the hallucinations.
I don't, but the workaround makes sense to me? Not sure why it would increase hallucinations.

Plain Text
{"role": "tool", "content": "Retrieved the image."},
{"role": "user", "content": [{"type": "text", "text": "Here's the image the tool retrieved"}, {"type": "image_url", "image_url": "..."}]}
something roughly like that
For parsing PDF, I recommend switching to ColPALI based models.
Add a reply
Sign up and join the conversation on Discord