Find answers from the community

Updated 4 months ago

Hello, I have a general kind of question

At a glance

Hello, I have a general kind of question, when using local models, what do you guys would be the best choice for the inference engine to use (free)? since llama-index supports vllm and ollama and llama cpp and even huggingface transformers, and alot of other integrations.
this is my setup:

2 Nvidia Quadro P4000 GPUs each 8gb of VRAM (which makes in total 16gb of VRAM)
intel Xeon 3.70 Ghz
32 gbs of RAM

model i'm trying to use is mistral 7b-instruct-v0.1

10 comments

WWhiteFang_Jr

I think ollama is pretty easy and clean to setup

gghxsted.

do you think it'll use both GPUs in parallel ?

WWhiteFang_Jr

Not sure for parallel but I think providing GPU devices to the code will help it to use all the available ones together:
os.environ["CUDA_VISIBLE_DEVICES"]="0,1,2,3"

gghxsted.

I'm not really sure what you mean, since i'm using ollama on windows, I only use cmd to start the local server with mistral, idk where to use the code you just provided

gghxsted.

i'm using mistral to do rag over some pdfs btw, so i use llama index

WWhiteFang_Jr

Oh okay on windows it might be different, The code that I provided is to let the code know that current machine contains 4 GPU.
So idea is to let your LLM model use all the available GPUs

gghxsted.

Oh alright, thanks ;)) will try it, however, do you know any benchmark that compares the different inferences engines ? I tried to use Vllm on linux and it didn't go so well, so I was wondering if there's a comparative benchmark of some sort that compares the speed of inference on the different engines ?

WWhiteFang_Jr

Not really. If i come across any, will share it here!

gghxsted.

thanks alot ;)) I'm trying to have the best inference speed without having to use quantized models since it's bad for the quality of responses

WWhiteFang_Jr

Yeah, quantization is great but it hampers the speed!

Add a reply