Ah okay, So you'll have to put in a server for it to be available otherwise for each query loading the LLM will take a lot of time.
There are two ways:
1: Your current way: You can use CustomLLM abstraction
https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom.html#example-using-a-custom-llm-model-advancedand in
complete
method load the model as you are doing right now and return the response in the given format.
2: Deploy your model on local server using fastAPI, just launch the server and keep the model active at endpoint v1/generate and use OpenAILike
https://docs.llamaindex.ai/en/stable/examples/llm/localai.html#llamaindex-interaction- Install the required pypi package:
pip install llama-index-llms-openai-like
- These LocalAIdefaults are
LOCALAI_DEFAULTS={
"api_key": "localai_fake",
"api_type": "localai_fake",
"api_base": f"http://localhost:8000/v1/generate",
}
from llama_index.core.llms import ChatMessage
from llama_index.llms.openai_like import OpenAILike
from llama_index.core import Settings
MAC_M1_LUNADEMO_CONSERVATIVE_TIMEOUT = 10 * 60 # sec
llm = OpenAILike(
**LOCALAI_DEFAULTS,
model="lunademo",
is_chat_model=True,
timeout=MAC_M1_LUNADEMO_CONSERVATIVE_TIMEOUT,
)
Setttings.llm = llm