Find answers from the community

Updated 12 months ago

i have implemented rag using llama-cpp-

i have implemented rag using llama-cpp-python with mistral7b openorca model but response time is too high although the api is hosted on sever which has 2 nvidia gpu RTX a4000 . can someone help me out
Attachment
Screenshot_2023-12-04_182526.png
L
A
a
16 comments
that is really slow... are you sure it's actually using the GPU?
I would suggest setting Verbose = True in LlamaCPP() to see if it's actually using the GPU. I am using llama-2 locally and it took me a few attempts to make it use the GPU (indicated by BLAS = 1) in the verbose output
@Logan M i am guessing it using because i am seeing BLAS=1
@Anurag Agrawal as rightly said by u i am seeing BLAS=1
did you set num_gpu_layers ?
Does it change if you try setting to a specific number? i.e like 42 or something
I didn't let me try
@Anurag Agrawal @Logan M gpu is in use as it is seen
Attachment
rn_image_picker_lib_temp_56df9e0f-d920-43dc-845d-09a4b08ba9b4.jpg
@Logan M plz help
I'm not a llama-cpp expert, I'm not sure (tbh I've never had good luck with speed on llama-cpp)
What kind of documents do you have? How many?
Is there a reason for using mistral? I am using llama-2 and am able to get an answer in as low as 4 secs but I index my document in advance. Are you indexing at query time?
There are around 2041 documents .
No i am loading a index from my local storage
What kind of index are you using?
@Logan M VectorStore Index
Add a reply
Sign up and join the conversation on Discord