On LlamaCpp, what parameters affect the inference loading time? I'm using VectorStoreIndex with an embedding model with chunk_size_limit=300 and my query engine is created in the following way cur_index.as_query_engine(streaming=True, similarity_top_k=3)
On the other hand, I thought the n_ctx didn't affect :(! So if I send 50 words with a n_ctx=512 it would be faster than if I send 50 words with a n_ctx=2048? Right? Why?
Correct! This is because the model is processing more tokens, and time scales quadratically with input length. It's a symptom of how LLMs are designed right now