On LlamaCpp what parameters affect the

At a glance

On LlamaCpp, what parameters affect the inference loading time? I'm using VectorStoreIndex with an embedding model with chunk_size_limit=300 and my query engine is created in the following way cur_index.as_query_engine(streaming=True, similarity_top_k=3)

llm_predictor = LLMPredictor(LlamaCpp(
model_path="./llms/guanaco-13B.ggmlv3.q5_1.bin",
n_ctx=2048,
max_tokens = 2048,
n_gpu_layers=32,
temperature=0,
verbose=True,
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
))

I dont know why it is taking like 15 min to respond

response = query_engine.query(f'### Human: {instruction}\n### Assistant: ')
for r in response.response_gen:
print(r, end='')

10 comments

LLogan M

llamacpp is going to be slow in general, since we are constantly pushing the context window to the max 🤔

Maybe try lowering the chunk_size in the service context before creating the index, or lower the top k to 1 or 2

SSanadh'eL

Hi @Logan M , thanks u so much for the answer,

When I set a small max_tokens, the answers are cut off :(!

SSanadh'eL

On the other hand, I thought the n_ctx didn't affect :(! So if I send 50 words with a n_ctx=512 it would be faster than if I send 50 words with a n_ctx=2048? Right? Why?

LLogan M

Correct! This is because the model is processing more tokens, and time scales quadratically with input length. It's a symptom of how LLMs are designed right now

SSanadh'eL

Ok, and what can I do to prevent cut off answers?

LLogan M

You'll have to increase the max_tokens parameter

Don't bump it up too much, maybe 350 is good?

https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-changing-the-number-of-output-tokens-for-openai-cohere-ai21

SSanadh'eL

Maybe it works for OpenAI, but not for Guanaco or Vicuna. I will try again.

LLogan M

Ah right, you are using llama cpp

LLogan M

Not sure foe that one

SSanadh'eL

Do you know about LLMs that support Spanish prompts? Or Spanish support is more related to the Embeddings model instead of the Inferece model?

Add a reply

Find answers from the community

On LlamaCpp what parameters affect the