Hello, I'm trying to run llama

Hello, I'm trying to run llama_index with llama2 7b model. For now, I have only a few PDF documents, each of which contains a 1-2 pages of text. I'm trying to run it on a pretty powerful GPU (each 50 GB), but I faced the problem that each very simple question takes 30-80 seconds to get a response. I was able to make it closer to 30 by reducing chunk_size to 128, but the pure Llama2 model generated a response to the same question for 12 seconds. How can I speed up the query_engine?

Find answers from the community

Hello, I'm trying to run llama_index