Sanadh'eL

·

On LlamaCpp what parameters affect the

On LlamaCpp, what parameters affect the inference loading time? I'm using VectorStoreIndex with an embedding model with chunk_size_limit=300 and my query engine is created in the following way cur_index.as_query_engine(streaming=True, similarity_top_k=3)

llm_predictor = LLMPredictor(LlamaCpp(
model_path="./llms/guanaco-13B.ggmlv3.q5_1.bin",
n_ctx=2048,
max_tokens = 2048,
n_gpu_layers=32,
temperature=0,
verbose=True,
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
))

I dont know why it is taking like 15 min to respond

response = query_engine.query(f'### Human: {instruction}\n### Assistant: ')
for r in response.response_gen:
print(r, end='')

10 comments

S

L

SSanadh'eL

·