yea there could definitely be a llama-cpp server integration. Otherwise, youd have to wrap your server API in a custom LLM class
Ok so I will need to implement the complete
method which will communicate with the llama-cpp server and return the text from the server response right?
Hey @Logan M Would it work if I change openai.api_base
and point that to the url of my self hosted api and use llama-index as is?
@Borg1903 definitely! As long as the api interface is the same
In looking to do the same thing (use a privately hosted llama model exposed via a network endpoint) with either Ollama or Llama.cpp server.
I'm running into two issues:
- If I supply a (fake) OPENAI_API_KEY I get an authentication error. Shouldn't happen because my LLM is defined by llm = Ollama(model="llama2")
- If I open the OPENAI_API_KEY it tries to run the model locally.
Any suggestions?
Can you share more of your setup?
Sure. I think the key parts are:
llm = Ollama(model="llama2") #need to make sure llama is listening on localhost:11434
embeddings = OllamaEmbeddings(base_url="[URL here]:11434", model="llama2")
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embeddings)
agent = ReActAgent.from_tools(query_engine_tools, llm=function_llm, verbose=True)
Basically following the Recursive Retriever + Document Agents sample but with Ollama.
I think you also need to pass the service context to the underlying query engines as well π€ (also not sure what function_llm is here)
Yea, could be VectorStoreIndex.from_documents(..., service_context=service_context)
Will give it a shot now. Thanks.
So it is still complaining about no OPENAI_API_KEY but does fallback to Using HuggingFaceBgeEmbeddings with model_name=BAAI/bge-small-en
I would expect it to make a call to Ollama for embeddings.
Then fails with a CUDA error (i've not passed it and access to GPUs as it should be leveraging the LLM via the network).
Here's what I have:
# build vector index
vector_index = VectorStoreIndex.from_documents(
product_docs[product_model], service_context=service_context
)
# build summary index
summary_index = SummaryIndex.from_documents(
product_docs[product_model], service_context=service_context
)
# define query engines
vector_query_engine = vector_index.as_query_engine(service_context=service_context)
list_query_engine = summary_index.as_query_engine(service_context=service_context)
and that service context has embed_model=...
in it?
(also fyi, those BGE embeddings are going to be much better than ollama π
but that's besides the point for now)
Yes
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embeddings)
Ok, just to make our lives easier (since something must be getting missed somewhere) you can also set a global service context
I agree that the BGE embeddings would likely be better. Just trying to prove the model at the moment.
from llama_index import set_global_service_context
set_global_service_context(service_context)
I set the global service context but no change in error.
Just to be explicit - here's the error:
******
Could not load OpenAIEmbedding. Using HuggingFaceBgeEmbeddings with model_name=BAAI/bge-small-en. If you intended to use OpenAI, please check your OPENAI_API_KEY.
Original error:
No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys
******
******
Could not load OpenAI model. Using default LlamaCPP=llama2-13b-chat. If you intended to use OpenAI, please check your OPENAI_API_KEY.
Original error:
No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys
******
CUDA error 35 at /tmp/pip-install-yfh1hz6h/llama-cpp-python_37ac4eecf55744769bddf959a0139175/vendor/llama.cpp/ggml-cuda.cu:5509: CUDA driver version is insufficient for CUDA runtime version
current device: 0
any way you can put the code into a single reproducible script?
So far it's just been snippets, would be helpful to be able to repduce locally
Just provided the code privately.
Of course it assumes you have a Ollama server running for an endpoint.
for sure -- I will swap it out for something else, just to test. Thanks!
I think the issue is related to this block
query_engine = RetrieverQueryEngine.from_args(
recursive_retriever,
response_synthesizer=response_synthesizer,
service_context=service_context,
base_url="<ip>:11434",
)
There's no way to assign a new base_url to the retrieverQueryEngine so it makes a call to
and that fails. There's no way to define a base_url in
llm = Ollama(model="llama2")
either.
Note that even though I passed "base_url" it's not utilzied.
Base url is already set on the Ollama LLM though
(sorry, haven't had a chance to test your code yet haha)
hmmm, you can't change that though; it's hard coded to localhost:11434
calling:
llm = Ollama(base_url="<ip>, model="llama2") #need to make sure llama is listening on localhost:11434
throws an error
The contributor that added Ollama pointed out that the URL is hardcoded in Ollama as well
the port is. But I can run Ollama on any server anywhere; it listens on 0.0.0.0 so long as I point my client to that server it should pick it up.
Note that it works for
embeddings = OllamaEmbeddings(base_url="http://<ip>:11434", model="llama2")
OllamaEmbeddings is from langchain -- I agree, it should be configurable. The contributor convinced me otherwise ha
The suggestion above should be an ok workaround for now (i think)
feel free to submit a PR though
Ok, thanks. Trying that workaround now. Appreciate your help on this and based on other discussions I think others will run into this too.
My fallback was to try to change the base url of OpenAPI and put up my own API that mimics OpenAI's
Workaround failed. It's still calling localhost.
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffb9fe745b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
one sec, let me just fix this
Just submitted the PR. Please review as I didn't test.
cool, added to the super init and merged π thx