I've "fixed" this, by implementing a

At a glance

I've "fixed" this, by implementing a TextClear to the ingestion pipeline to remove all non-text charcaters: https://docs.llamaindex.ai/en/stable/module_guides/loading/ingestion_pipeline/transformations.html. Now however, it's returning a lot of useless entries (ie ones that from a PDF page with just an image and page number, so the excerpt is just a number.

4 comments

LLogan M

Is this just an issue with how the PDF is being read?

DDarthus

It seems like for some reason Gemini just didn't like the unicode characters. Claude/OpenAI were able to read around them, but Gemini just seemed to short circuit. In addition, it seems to prioritize short entries (so was pulling back like 4 character page number entries) for some reason. I had to clear out both fo those (The unicode characters and the short entries) before it started pulling back relevant chunks. Even then it seemed to be doing a relatively poor job so I just went back to Claude/OpenAI embeddings. 🙂

DDarthus

Also Gemini was "soft" limiting me to like 16000 tokens (even on Gemini 1.5 Pro which I was very excited to get access to in the API), which is annoying since it advertises a 1 million token context window. It just returns a 500 server error if I sent anything more than 16000 tokens. I suspect it's still in testing.

DDarthus

Also @Logan M , to follow up, the Gemini API seems to be accepting System prompts now, which is great.

Add a reply

Find answers from the community

I've "fixed" this, by implementing a