Hi, I'm using a fine tuned version of

CCHY4E

Hi, I'm using a fine tuned version of llama2 with 13b parameters, the llm is ran by Ollama.
for response mode I'm using "refine" as the responses are much better than "compact"
my top_k is 3
responses take about 30 Seconds, most of that time is used on the llm.
is there any way to improve that speed without impacting quality of responses? I know it has to make many requests because of refine, but it just provides the best answers (from my testing).
I have two rtx 4090, however I think ollama already uses both.

4 comments

LLogan M

I don't really think there's a way to improve the speed, besides using a different LLM hosting option (vLLM, text-gen, etc.)

CCHY4E

Thank you for the response. I saw some 20-30%ish improvement by using only one GPU (obviously), but I will definitely try other hosting options

CCHY4E

hey! Thanks for the help so far.
I'm now using vLLM with the same model.
However the output is really messed up now (compared to ollama).
is it possible that I have to adjust the system prompt or anything?
sadly didnt find any property

I did adjust all the penaltys and top_p etc to match ollama.
I found the "system_prompt" property but as it seems there isnt any option for placeholders

CCHY4E

got it working properly

Add a reply

Find answers from the community

Hi, I'm using a fine tuned version of