how do I use a local LLM

3 comments

Please stand by @Steve_h_112 as I search the knowledge base. We are experiencing especially high traffic right now and might take longer than usual. :wave:

kkapa.ai

----------------------
To use a local LLM with LlamaIndex, you need to implement the LLM class or CustomLLM for a simpler interface. You will be responsible for passing the text to the model and returning the newly generated tokens. Here is an example of how to do this using a locally running facebook/OPT model and Huggingface’s pipeline abstraction:

```python
import torch
from transformers import pipeline
from typing import Optional, List, Mapping, Any

from llama_index import (
ServiceContext,
SimpleDirectoryReader,
LangchainEmbedding,
ListIndex
)
from llama_index.llms import CustomLLM, CompletionResponse, LLMMetadata

set context window size

context_window = 2048

set number of output tokens

num_output = 256

store the pipeline/model outside of the LLM class to avoid memory issues

model_name = "facebook/opt-iml-max-30b"
pipeline = pipeline("text-generation", model=model_name, device="cuda:0", model_kwargs={"torch_dtype":torch.bfloat16})

class OurLLM(CustomLLM):

@property
def metadata(self) -> LLMMetadata:
"""Get LLM metadata."""
return LLMMetadata(
context_window=context_window, num_output=num_output
)

def complete(self, prompt: str, kwargs: Any) -> CompletionResponse: prompt_length = len(prompt) response = pipeline(prompt, max_new_tokens=num_output)[0]["generated_text"] # only return newly generated tokens text = response[prompt_length:] return CompletionResponse(text=text) def stream_complete(self, prompt: str, kwargs: Any) -> CompletionResponseGen:
raise NotImplementedError()

define our LLM

llm = OurLLM()

service_context = ServiceContext.from_defaults(
llm=llm,
context_window=context_window,
num_output=num_output
)

kkapa.ai

Load the your data

documents = SimpleDirectoryReader('./data').load_data()
index = ListIndex.from_documents(documents, service_context=service_context)

Query and print response

query_engine = index.as_query_engine()
response = query_engine.query("<query_text>")
print(response)

``

This example uses the

CustomLLM class to define a new LLM that uses a locally running model. The complete method is responsible for passing the text to the model and returning the newly generated tokens. The stream_complete

 method is not implemented in this example.

Please note that you may have to adjust the internal prompts to get good performance. Even then, you should be using a sufficiently large LLM to ensure it’s capable of handling the complex queries that LlamaIndex uses internally, so your mileage may vary.

For more details, you can refer to the **LlamaIndex documentation** (<https://gpt-index.readthedocs.io/en/latest/how_to/customization/custom_llms.html>).
----------------------

Note: you can ask me a follow up question by @ mentioning me again` :speech_balloon:
----------------------

Add a reply

Find answers from the community