Find answers from the community

Updated 3 months ago

On LlamaCpp what parameters affect the

On LlamaCpp, what parameters affect the inference loading time? I'm using VectorStoreIndex with an embedding model with chunk_size_limit=300 and my query engine is created in the following way cur_index.as_query_engine(streaming=True, similarity_top_k=3)

llm_predictor = LLMPredictor(LlamaCpp(
model_path="./llms/guanaco-13B.ggmlv3.q5_1.bin",
n_ctx=2048,
max_tokens = 2048,
n_gpu_layers=32,
temperature=0,
verbose=True,
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
))

I dont know why it is taking like 15 min to respond

response = query_engine.query(f'### Human: {instruction}\n### Assistant: ')
for r in response.response_gen:
print(r, end='')
L
S
10 comments
llamacpp is going to be slow in general, since we are constantly pushing the context window to the max πŸ€”

Maybe try lowering the chunk_size in the service context before creating the index, or lower the top k to 1 or 2
Hi @Logan M , thanks u so much for the answer,

When I set a small max_tokens, the answers are cut off :(!
On the other hand, I thought the n_ctx didn't affect :(! So if I send 50 words with a n_ctx=512 it would be faster than if I send 50 words with a n_ctx=2048? Right? Why?
Correct! This is because the model is processing more tokens, and time scales quadratically with input length. It's a symptom of how LLMs are designed right now
Ok, and what can I do to prevent cut off answers?
Maybe it works for OpenAI, but not for Guanaco or Vicuna. I will try again.
Ah right, you are using llama cpp
Not sure foe that one
Do you know about LLMs that support Spanish prompts? Or Spanish support is more related to the Embeddings model instead of the Inferece model?
Add a reply
Sign up and join the conversation on Discord