Hi, I'm using a fine tuned version of llama2 with 13b parameters, the llm is ran by Ollama. for response mode I'm using "refine" as the responses are much better than "compact" my top_k is 3 responses take about 30 Seconds, most of that time is used on the llm. is there any way to improve that speed without impacting quality of responses? I know it has to make many requests because of refine, but it just provides the best answers (from my testing). I have two rtx 4090, however I think ollama already uses both.
hey! Thanks for the help so far. I'm now using vLLM with the same model. However the output is really messed up now (compared to ollama). is it possible that I have to adjust the system prompt or anything? sadly didnt find any property
I did adjust all the penaltys and top_p etc to match ollama. I found the "system_prompt" property but as it seems there isnt any option for placeholders