Find answers from the community

Updated 8 months ago

```python

Plain Text
Settings.llm = Ollama(model="llama2", request_timeout=30, temperature=0)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", device="cuda")

documents = SimpleDirectoryReader("/home/chepworth/PycharmProjects/cmar/RAG_Data").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(streaming=True)

response = query_engine.query("How big is saturn?")
response.print_response_stream()
L
G
6 comments
running LLMs locally is generally pretty slow. With ollama on my M2 mac, responses generally take a minute

You probably want to increase the response timeout there too
Sometimes its very fast and sometimes its very slow
I'm on a decent machine
Depends on a) how much context there is on input and b) how much the LLM decides to write

I have an M2 Pro Max, it still takes about 30-60s sometimes
Does it depend on how much the LLM decides to write if you're streaming the response?
Well, the stream will last longer lol. But the time until the stream is the model evaluating the input text (which if large, and it usually is, will take some time)
Add a reply
Sign up and join the conversation on Discord