Hello everyone! I’m working on improving my RAG pipeline by extracting images from my PDF files. While I haven’t encountered significant challenges in the ingestion and indexing phases, I’m a bit uncertain when it comes to retrieval.
Currently, retrieval is handled through tool calls, allowing the model to determine when additional information is needed to answer a user’s query. I’m using GPT-4o via OpenAI’s API, but since the output can only contain text and not images, I’m facing a limitation. My goal is to pass images—if present in the retrieved chunks—to enhance the quality of responses.
What would be the best way to overcome OpenAI’s API constraints? Has anyone else faced a similar issue? If so, how did you resolve it?
I've also attached an example of an API call I attempted, but it didn’t work as expected.
I completely agree with you, I fail to see the logic behind it. Do you have any experience with the solution you proposed? I have considered it myself, but it doesn’t quite convince me. I lack concrete data to support my argument, yet I have the impression that it increases the hallucinations.