LlamaIndex

Log inLog into community

Find answers from the community

Updated 4 months ago

Hello I have been following up this

Hello I have been following up this

At a glance

·

Hello, I have been following up this tutorial: https://gpt-index.readthedocs.io/en/latest/examples/llm/llama_2_llama_cpp.html. I have a problem, the query function takes extremely long time. (like 8-10 mins). I know that this is common problem when it's come to llamacpp but llamacpp work pretty ok with just prompting and answering. Problem starts with qa, indexing, embedding and so on. I can share my code as well if needed. Any help is appreciated.

a

E

L

41 comments

what are your model_kwargs? your gpu? also in the tutorial if you look at the bottom of the page, the total run time is of around 3 minutes

model_kwargs={"n_gpu_layers": 1},
yeah this is my model_kwargs

what's your equivalent of this?

Attachment

Screenshot_2023-09-28_at_12.30.30.png

how can I get this data

i believe that if you have verbose set to True should be in your terminal

right before your "answer"

oh, sorry for dummy question then😂

Attachment

Screenshot_2023-09-28_at_13.32.23.png

your eval time is very long, but I think that can be influenced by how fast you are able to retrieve relevant info from your vectorDB

which embed model are you using? the same as the tutorial?

yeah almost same, I use bge-large-en-v1.5

but it was still same when I use the exactly the same as tutorial

maybe you can try to change it?

this works well for me: HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

but anyway i am not an expert, so maybe you can wait for someone else

but it might be worth giving this a shot

Yeah thank you so much, I will definitly give it a try

nice, let me know!

Changing embedding did not work

I believe the problem is about retrievel time from vector db

maybe changing indexing might solve

ah true makes sense

I guess it also depends on how big your document is and different things, but your performance wasn't substantially slower than the one in the tutorial so I'd say it's expected to have something like that no?

my document is around 800 pages long pdf. I suppose they have use much shorter one in example, but still I gotta make this faster to use it in project🥲

maybe you can try changing the storage context as well

here I use chroma, but there are multiple ones

Attachment

Screenshot_2023-09-28_at_13.33.17.png

No, this made it even worse

@Logan M any ideas

Did you install llama-cpp to run on your GPU?

yes, actually it works pretty well without llama-index

I can ask random questions and it performs well

LlamaIndex packs the prompt though

which slows things down

Like if you send 4000 tokens to it, I'd expect it to be just as slow outside of llama index

What kind of index are you using? Just a vector index? Didn't adjust the chunk size or top k?

yeah default ones

what are the optimum ones in this case?

Decreasing the chunk size might help (less tokens to evaluate). Although your tokens / s are a little low for running on GPU tbh. You'll notice @andreamaestri had much faster timings, but maybe that's a hardware difference between you two

How much token/s would you expect?

I get around 30 tokens/s with a query engine. Maybe try setting the gpu-layers to -1

Oh, then I might have a problem related to that. Because I have NVIDIA L4 24 GB GPU, which is pretty strong.

Add a reply

Sign up and join the conversation on Discord