what are your model_kwargs? your gpu? also in the tutorial if you look at the bottom of the page, the total run time is of around 3 minutes
model_kwargs={"n_gpu_layers": 1},
yeah this is my model_kwargs
what's your equivalent of this?
i believe that if you have verbose set to True should be in your terminal
right before your "answer"
oh, sorry for dummy question then😂
your eval time is very long, but I think that can be influenced by how fast you are able to retrieve relevant info from your vectorDB
which embed model are you using? the same as the tutorial?
yeah almost same, I use bge-large-en-v1.5
but it was still same when I use the exactly the same as tutorial
maybe you can try to change it?
this works well for me: HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")
but anyway i am not an expert, so maybe you can wait for someone else
but it might be worth giving this a shot
Yeah thank you so much, I will definitly give it a try
Changing embedding did not work
I believe the problem is about retrievel time from vector db
maybe changing indexing might solve
I guess it also depends on how big your document is and different things, but your performance wasn't substantially slower than the one in the tutorial so I'd say it's expected to have something like that no?
my document is around 800 pages long pdf. I suppose they have use much shorter one in example, but still I gotta make this faster to use it in project🥲
maybe you can try changing the storage context as well
here I use chroma, but there are multiple ones
No, this made it even worse
Did you install llama-cpp to run on your GPU?
yes, actually it works pretty well without llama-index
I can ask random questions and it performs well
LlamaIndex packs the prompt though
Like if you send 4000 tokens to it, I'd expect it to be just as slow outside of llama index
What kind of index are you using? Just a vector index? Didn't adjust the chunk size or top k?
what are the optimum ones in this case?
Decreasing the chunk size might help (less tokens to evaluate). Although your tokens / s are a little low for running on GPU tbh. You'll notice @andreamaestri had much faster timings, but maybe that's a hardware difference between you two
How much token/s would you expect?
I get around 30 tokens/s with a query engine. Maybe try setting the gpu-layers to -1
Oh, then I might have a problem related to that. Because I have NVIDIA L4 24 GB GPU, which is pretty strong.