Find answers from the community

Home
Members
Sanadh'eL
S
Sanadh'eL
Offline, last seen 3 months ago
Joined September 25, 2024
On LlamaCpp, what parameters affect the inference loading time? I'm using VectorStoreIndex with an embedding model with chunk_size_limit=300 and my query engine is created in the following way cur_index.as_query_engine(streaming=True, similarity_top_k=3)

llm_predictor = LLMPredictor(LlamaCpp(
model_path="./llms/guanaco-13B.ggmlv3.q5_1.bin",
n_ctx=2048,
max_tokens = 2048,
n_gpu_layers=32,
temperature=0,
verbose=True,
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
))

I dont know why it is taking like 15 min to respond

response = query_engine.query(f'### Human: {instruction}\n### Assistant: ')
for r in response.response_gen:
print(r, end='')
10 comments
S
L
Hello there! How can I use llama_index with GPU?
2 comments
S
L
Yes, a chat interface like the ChatGPT chatbot. Not software development, just use for academic purposes.
5 comments
S
s
L