Find answers from the community

Updated last year

i have implemented rag using llama-cpp-

At a glance

i have implemented rag using llama-cpp-python with mistral7b openorca model but response time is too high although the api is hosted on sever which has 2 nvidia gpu RTX a4000 . can someone help me out

Attachment

16 comments

LLogan M

that is really slow... are you sure it's actually using the GPU?

AAnurag Agrawal

I would suggest setting Verbose = True in LlamaCPP() to see if it's actually using the GPU. I am using llama-2 locally and it took me a few attempts to make it use the GPU (indicated by BLAS = 1) in the verbose output

aadeelhasan

@Logan M i am guessing it using because i am seeing BLAS=1

aadeelhasan

@Anurag Agrawal as rightly said by u i am seeing BLAS=1

LLogan M

did you set num_gpu_layers ?

aadeelhasan

Yes " -1"

LLogan M

Does it change if you try setting to a specific number? i.e like 42 or something

aadeelhasan

I didn't let me try

aadeelhasan

@Anurag Agrawal @Logan M gpu is in use as it is seen

Attachment

rn_image_picker_lib_temp_56df9e0f-d920-43dc-845d-09a4b08ba9b4.jpg

aadeelhasan

@Logan M plz help

LLogan M

I'm not a llama-cpp expert, I'm not sure (tbh I've never had good luck with speed on llama-cpp)

AAnurag Agrawal

What kind of documents do you have? How many?

AAnurag Agrawal

Is there a reason for using mistral? I am using llama-2 and am able to get an answer in as low as 4 secs but I index my document in advance. Are you indexing at query time?

aadeelhasan

There are around 2041 documents .
No i am loading a index from my local storage

LLogan M

What kind of index are you using?

aadeelhasan

@Logan M VectorStore Index

Add a reply