Yea like directly in the service context right?
Hmm interesting, with these settings, a query would retrieve two chunks (2*512 tokens), the prompt template + query is probably ~200 tokens, and we need to leave room for 256 output tokens. So, ~1480 tokens, which is way less than 2048
With these settings, I think it's only making one LLM call? Camel might just be really that slow
You can test it directly by doing something like this
pred, prompt = llm_predictor.predict("really long string asking to tell a joke or something idk")
And use a string that's like 1500 tokens lol and see how fast it is