Find answers from the community

Updated 2 years ago

Local api

At a glance

The community member is working on a project using the locally installed Llama 2 model and a simple API interface. They want to integrate this with the existing LlamaIndex library without changing too much of their code. The comments suggest that the best approach is to implement the LLM class, providing an example from the LlamaIndex documentation. The community members further discuss how to set up the API access points and how LlamaIndex decides when to stop calling the LLM API and return the final answer to the user. However, there is no explicitly marked answer in the provided information.

Useful resources
hi all, i am doing some project with locally installed llama 2 and following simple API interface:

{
"input": "how is weather in new york",
"context":"new york is hot in these days"
}

input the query and context should coming from the the vector DB. How i can get it integrate with existing lllamaindex library without change too much of my codes ? @WhiteFang_Jr
L
a
4 comments
Your best bet is implementing the LLM class

There's a small example here https://gpt-index.readthedocs.io/en/latest/core_modules/model_modules/llms/usage_custom.html#example-using-a-custom-llm-model-advanced

Basically in the complete/stream_complete endpoints, you'll want to send requests to your api

There's also chat/stream_chat endpoints, if you want to handle how lists of chat messages get sent to the LLM as well
thanks for the quick response. But, how to setup the API access points point ?
@Logan M further questsion, how to setup the API access point, etc in below section ?

class OurLLM(CustomLLM):

@property
def metadata(self) -> LLMMetadata:
"""Get LLM metadata."""
return LLMMetadata(
context_window=context_window,
num_output=num_output,
model_name=model_name
)

def complete(self, prompt: str, kwargs: Any) -> CompletionResponse: prompt_length = len(prompt) response = pipeline(prompt, max_new_tokens=num_output)[0]["generated_text"] # only return newly generated tokens text = response[prompt_length:] return CompletionResponse(text=text) def stream_complete(self, prompt: str, kwargs: Any) -> CompletionResponseGen:
raise NotImplementedError()
and in the normal api interaction with llm, i believe llamaindex will qurey the llm couples time for a re-fine. How llamaindex decide stop calling llm api and return the last answer back to user ?
Add a reply
Sign up and join the conversation on Discord