Find answers from the community

Updated 9 months ago

```python

Plain Text

Settings.llm = Ollama(model="llama2", request_timeout=30, temperature=0)
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5", device="cuda")

documents = SimpleDirectoryReader("/home/chepworth/PycharmProjects/cmar/RAG_Data").load_data()

index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine(streaming=True)

response = query_engine.query("How big is saturn?")
response.print_response_stream()

6 comments

LLogan M

running LLMs locally is generally pretty slow. With ollama on my M2 mac, responses generally take a minute

You probably want to increase the response timeout there too

GGroovyTaco

Sometimes its very fast and sometimes its very slow

GGroovyTaco

I'm on a decent machine

LLogan M

Depends on a) how much context there is on input and b) how much the LLM decides to write

I have an M2 Pro Max, it still takes about 30-60s sometimes

GGroovyTaco

Does it depend on how much the LLM decides to write if you're streaming the response?

LLogan M

Well, the stream will last longer lol. But the time until the stream is the model evaluating the input text (which if large, and it usually is, will take some time)

Add a reply