Find answers from the community

Updated last year

Hi, I'm using a fine tuned version of

Hi, I'm using a fine tuned version of llama2 with 13b parameters, the llm is ran by Ollama.
for response mode I'm using "refine" as the responses are much better than "compact"
my top_k is 3
responses take about 30 Seconds, most of that time is used on the llm.
is there any way to improve that speed without impacting quality of responses? I know it has to make many requests because of refine, but it just provides the best answers (from my testing).
I have two rtx 4090, however I think ollama already uses both.
L
C
4 comments
I don't really think there's a way to improve the speed, besides using a different LLM hosting option (vLLM, text-gen, etc.)
Thank you for the response. I saw some 20-30%ish improvement by using only one GPU (obviously), but I will definitely try other hosting options
hey! Thanks for the help so far.
I'm now using vLLM with the same model.
However the output is really messed up now (compared to ollama).
is it possible that I have to adjust the system prompt or anything?
sadly didnt find any property

I did adjust all the penaltys and top_p etc to match ollama.
I found the "system_prompt" property but as it seems there isnt any option for placeholders
got it working properly
Add a reply
Sign up and join the conversation on Discord