i have implemented rag using llama-cpp-python with mistral7b openorca model but response time is too high although the api is hosted on sever which has 2 nvidia gpu RTX a4000 . can someone help me out
I would suggest setting Verbose = True in LlamaCPP() to see if it's actually using the GPU. I am using llama-2 locally and it took me a few attempts to make it use the GPU (indicated by BLAS = 1) in the verbose output
Is there a reason for using mistral? I am using llama-2 and am able to get an answer in as low as 4 secs but I index my document in advance. Are you indexing at query time?