Find answers from the community

Updated 4 months ago

Hello I have been following up this

At a glance
Hello, I have been following up this tutorial: https://gpt-index.readthedocs.io/en/latest/examples/llm/llama_2_llama_cpp.html. I have a problem, the query function takes extremely long time. (like 8-10 mins). I know that this is common problem when it's come to llamacpp but llamacpp work pretty ok with just prompting and answering. Problem starts with qa, indexing, embedding and so on. I can share my code as well if needed. Any help is appreciated.
a
E
L
41 comments
what are your model_kwargs? your gpu? also in the tutorial if you look at the bottom of the page, the total run time is of around 3 minutes
model_kwargs={"n_gpu_layers": 1},
yeah this is my model_kwargs
what's your equivalent of this?
Attachment
Screenshot_2023-09-28_at_12.30.30.png
how can I get this data
i believe that if you have verbose set to True should be in your terminal
right before your "answer"
oh, sorry for dummy question then😂
Attachment
Screenshot_2023-09-28_at_13.32.23.png
your eval time is very long, but I think that can be influenced by how fast you are able to retrieve relevant info from your vectorDB
which embed model are you using? the same as the tutorial?
yeah almost same, I use bge-large-en-v1.5
but it was still same when I use the exactly the same as tutorial
maybe you can try to change it?
this works well for me: HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
but anyway i am not an expert, so maybe you can wait for someone else
but it might be worth giving this a shot
Yeah thank you so much, I will definitly give it a try
nice, let me know!
Changing embedding did not work
I believe the problem is about retrievel time from vector db
maybe changing indexing might solve
ah true makes sense
I guess it also depends on how big your document is and different things, but your performance wasn't substantially slower than the one in the tutorial so I'd say it's expected to have something like that no?
my document is around 800 pages long pdf. I suppose they have use much shorter one in example, but still I gotta make this faster to use it in project🥲
maybe you can try changing the storage context as well
here I use chroma, but there are multiple ones
Attachment
Screenshot_2023-09-28_at_13.33.17.png
No, this made it even worse
@Logan M any ideas
Did you install llama-cpp to run on your GPU?
yes, actually it works pretty well without llama-index
I can ask random questions and it performs well
LlamaIndex packs the prompt though
which slows things down
Like if you send 4000 tokens to it, I'd expect it to be just as slow outside of llama index
What kind of index are you using? Just a vector index? Didn't adjust the chunk size or top k?
yeah default ones
what are the optimum ones in this case?
Decreasing the chunk size might help (less tokens to evaluate). Although your tokens / s are a little low for running on GPU tbh. You'll notice @andreamaestri had much faster timings, but maybe that's a hardware difference between you two
How much token/s would you expect?
I get around 30 tokens/s with a query engine. Maybe try setting the gpu-layers to -1
Oh, then I might have a problem related to that. Because I have NVIDIA L4 24 GB GPU, which is pretty strong.
Add a reply
Sign up and join the conversation on Discord