Find answers from the community

Updated last year

Logan M I was trying out llama2 model

@Logan M I was trying out llama2 model with llama-index and I know there's the LlamaCpp class in this library. However say I am hosting a llama-cpp model on my own server, is there a way I can get llama-index to use the model from the server? With the current implementation, we need to store the llama model file locally, but I want to store and use that llama model file from somewhere else entirely by hosting it on a server and making it accessible to other applications similar to the OpenAI api. Is it possible to do something like this?
L
B
G
57 comments
yea there could definitely be a llama-cpp server integration. Otherwise, youd have to wrap your server API in a custom LLM class
Ok so I will need to implement the complete method which will communicate with the llama-cpp server and return the text from the server response right?
you got it πŸ‘
Hey @Logan M Would it work if I change openai.api_base and point that to the url of my self hosted api and use llama-index as is?
@Borg1903 definitely! As long as the api interface is the same
Cool got it!
In looking to do the same thing (use a privately hosted llama model exposed via a network endpoint) with either Ollama or Llama.cpp server.

I'm running into two issues:
  1. If I supply a (fake) OPENAI_API_KEY I get an authentication error. Shouldn't happen because my LLM is defined by llm = Ollama(model="llama2")
  2. If I open the OPENAI_API_KEY it tries to run the model locally.
Any suggestions?
Can you share more of your setup?
Sure. I think the key parts are:
llm = Ollama(model="llama2") #need to make sure llama is listening on localhost:11434
embeddings = OllamaEmbeddings(base_url="[URL here]:11434", model="llama2")
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embeddings)
agent = ReActAgent.from_tools(query_engine_tools, llm=function_llm, verbose=True)

Basically following the Recursive Retriever + Document Agents sample but with Ollama.
I think you also need to pass the service context to the underlying query engines as well πŸ€” (also not sure what function_llm is here)
Following https://gpt-index.readthedocs.io/en/stable/examples/query_engine/recursive_retriever_agents.html

the function_llm is invoking an "agent" which I'm also leveraging llama for (attempting to).

It's defined as:
function_llm = Ollama(model="llama2")

Could be that I didn't pass the service context to the query engine. I can try that.

define query engines

vector_query_engine = vector_index.as_query_engine()
list_query_engine = summary_index.as_query_engine()
Yea, could be VectorStoreIndex.from_documents(..., service_context=service_context)
Will give it a shot now. Thanks.
So it is still complaining about no OPENAI_API_KEY but does fallback to Using HuggingFaceBgeEmbeddings with model_name=BAAI/bge-small-en

I would expect it to make a call to Ollama for embeddings.
Then fails with a CUDA error (i've not passed it and access to GPUs as it should be leveraging the LLM via the network).
Here's what I have:
Plain Text
 # build vector index
    vector_index = VectorStoreIndex.from_documents(
        product_docs[product_model], service_context=service_context
    )
    # build summary index
    summary_index = SummaryIndex.from_documents(
        product_docs[product_model], service_context=service_context
    )
    # define query engines
    vector_query_engine = vector_index.as_query_engine(service_context=service_context)
    list_query_engine = summary_index.as_query_engine(service_context=service_context)
and that service context has embed_model=... in it?
(also fyi, those BGE embeddings are going to be much better than ollama πŸ˜… but that's besides the point for now)
Yes
Plain Text
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embeddings)
Ok, just to make our lives easier (since something must be getting missed somewhere) you can also set a global service context
I agree that the BGE embeddings would likely be better. Just trying to prove the model at the moment.
Plain Text
from llama_index import set_global_service_context

set_global_service_context(service_context)
I set the global service context but no change in error.
Just to be explicit - here's the error:
Plain Text
******
Could not load OpenAIEmbedding. Using HuggingFaceBgeEmbeddings with model_name=BAAI/bge-small-en. If you intended to use OpenAI, please check your OPENAI_API_KEY.
Original error:
No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys

******
******
Could not load OpenAI model. Using default LlamaCPP=llama2-13b-chat. If you intended to use OpenAI, please check your OPENAI_API_KEY.
Original error:
No API key found for OpenAI.
Please set either the OPENAI_API_KEY environment variable or openai.api_key prior to initialization.
API keys can be found or created at https://platform.openai.com/account/api-keys

******

CUDA error 35 at /tmp/pip-install-yfh1hz6h/llama-cpp-python_37ac4eecf55744769bddf959a0139175/vendor/llama.cpp/ggml-cuda.cu:5509: CUDA driver version is insufficient for CUDA runtime version
current device: 0
any way you can put the code into a single reproducible script?
So far it's just been snippets, would be helpful to be able to repduce locally
Just provided the code privately.
Of course it assumes you have a Ollama server running for an endpoint.
for sure -- I will swap it out for something else, just to test. Thanks!
I think the issue is related to this block
Plain Text
query_engine = RetrieverQueryEngine.from_args(
    recursive_retriever,
    response_synthesizer=response_synthesizer,
    service_context=service_context, 
    base_url="<ip>:11434",
)

There's no way to assign a new base_url to the retrieverQueryEngine so it makes a call to
Plain Text
localhost:11434
and that fails. There's no way to define a base_url in
Plain Text
llm = Ollama(model="llama2")
either.
Note that even though I passed "base_url" it's not utilzied.
Base url is already set on the Ollama LLM though
(sorry, haven't had a chance to test your code yet haha)
hmmm, you can't change that though; it's hard coded to localhost:11434
calling:
Plain Text
llm = Ollama(base_url="<ip>, model="llama2") #need to make sure llama is listening on localhost:11434

throws an error
The contributor that added Ollama pointed out that the URL is hardcoded in Ollama as well
the port is. But I can run Ollama on any server anywhere; it listens on 0.0.0.0 so long as I point my client to that server it should pick it up.
llm.base_url = "...."
Note that it works for
Plain Text
embeddings = OllamaEmbeddings(base_url="http://<ip>:11434", model="llama2") 
OllamaEmbeddings is from langchain -- I agree, it should be configurable. The contributor convinced me otherwise ha
The suggestion above should be an ok workaround for now (i think)
feel free to submit a PR though
Ok, thanks. Trying that workaround now. Appreciate your help on this and based on other discussions I think others will run into this too.
My fallback was to try to change the base url of OpenAPI and put up my own API that mimics OpenAI's
Workaround failed. It's still calling localhost.
Plain Text
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=11434): Max retries exceeded with url: /api/generate/ (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7ffb9fe745b0>: Failed to establish a new connection: [Errno 111] Connection refused'))
hmmm, well that's lame
one sec, let me just fix this
ah you beat me to it
Just submitted the PR. Please review as I didn't test.
cool, added to the super init and merged πŸ‘ thx
nice. Thanks!
Add a reply
Sign up and join the conversation on Discord