Find answers from the community

Updated 3 months ago

Hello, I have a general kind of question

Hello, I have a general kind of question, when using local models, what do you guys would be the best choice for the inference engine to use (free)? since llama-index supports vllm and ollama and llama cpp and even huggingface transformers, and alot of other integrations.
this is my setup:
  • 2 Nvidia Quadro P4000 GPUs each 8gb of VRAM (which makes in total 16gb of VRAM)
  • intel Xeon 3.70 Ghz
  • 32 gbs of RAM
model i'm trying to use is mistral 7b-instruct-v0.1
W
g
10 comments
I think ollama is pretty easy and clean to setup
do you think it'll use both GPUs in parallel ?
Not sure for parallel but I think providing GPU devices to the code will help it to use all the available ones together:
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"
I'm not really sure what you mean, since i'm using ollama on windows, I only use cmd to start the local server with mistral, idk where to use the code you just provided
i'm using mistral to do rag over some pdfs btw, so i use llama index
Oh okay on windows it might be different, The code that I provided is to let the code know that current machine contains 4 GPU.
So idea is to let your LLM model use all the available GPUs
Oh alright, thanks ;)) will try it, however, do you know any benchmark that compares the different inferences engines ? I tried to use Vllm on linux and it didn't go so well, so I was wondering if there's a comparative benchmark of some sort that compares the speed of inference on the different engines ?
Not really. If i come across any, will share it here!
thanks alot ;)) I'm trying to have the best inference speed without having to use quantized models since it's bad for the quality of responses
Yeah, quantization is great but it hampers the speed!
Add a reply
Sign up and join the conversation on Discord