Custom LLM Streaming

At a glance

I can't find a doc on how to implement streaming for custom LLM, the CompletionResponseGen section is not implemented in the doc

33 comments

EEmanuel Ferreira

Are you using a HuggingFaceLLM or LangchainLLM? these ones is already supported by llamaindex if not you will need to do some hacks

BBar Haim

@Emanuel Ferreira I use just API requests with stream=True

BBar Haim

Any idea how to use the object CompletionResponseGen?

EEmanuel Ferreira

maybe @Logan M can help better on that

BBar Haim

@Logan M i try to use yield CompletionResponseGen(text=generated_text) but i get error Generator() takes no arguments

EEmanuel Ferreira

Maybe you can take a look how it's being implemented with HuggingFace and Anthropic to have some direction

https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/huggingface.py#L216-L259
https://github.com/jerryjliu/llama_index/blob/main/llama_index/llms/anthropic.py#L171-L173

LLogan M

@Bar Haim yea, if you look at those examples above, you need to return a function that will yield ChatResponse(..) objects

BBar Haim

Thanks, it works

BBar Haim

however when I use it as stream model for index query and using for token in response.response_gen the token are empty strings

BBar Haim

but when using with complete

Plain Text

    llm = BamLLM()
    resp = llm.stream_complete('1 + 1')
    for delta in resp:
        print(delta, end='')

it works

BBar Haim

I want to use it with index.query

BBar Haim

this code:

Plain Text

streaming_response = self.query_engine.query(
            prompt
        )
        for token in streaming_response.response_gen:
            print(token)

give me empty strings

BBar Haim

it prints but empty strings

LLogan M

hmm, are you able to share the code for the stream_complete function? If not, I can try and make an example

BBar Haim

Plain Text

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        data = {
            "model_id": os.getenv('MODEL_NAME'),
            "inputs": [prompt],
            "parameters": {
                "temperature": float(os.getenv('TEMPERATURE')),
                "max_new_tokens": int(os.getenv('MAX_OUTPUT_TOKENS')),
                "stream": True
            }
        }
        headers = {
            "Authorization": f"Bearer {os.getenv('GENAI_KEY')}",
        }
        response = requests.post(os.getenv('GENAI_API'), json=data, headers=headers, stream=True)
        if response.status_code == 200:
            for chunk in response.iter_content(chunk_size=4096):
                try:
                    if chunk:
                        output_str = chunk.decode('utf-8')
                        if output_str.startswith('data: '):
                            output_str = output_str[len('data: '):]
                        data = json.loads(output_str)
                        generated_text = data['results'][0]['generated_text']
                        yield CompletionResponse(text=generated_text)
                except Exception as ex:
                    print(str(ex))

BBar Haim

@Logan M this is my steam_complete

BBar Haim

if i use it with

Plain Text

 llm = BamLLM()
    resp = llm.stream_complete('1 + 1')
    for delta in resp:
        print(delta, end='')

it is working, but not when using with index query engine

BBar Haim

Attachment

LLogan M

cool! Let me double check the code

BBar Haim

when using

Plain Text

 streaming_response = self.query_engine.query(
            prompt
        )
        for token in streaming_response.response_gen:
            print(token)

it's all empty

Attachment

BBar Haim

if i add a print on the method itself before the yields it is printing ok

BBar Haim

it shows just the printings

Attachment

LLogan M

Plain Text

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        data = {
            "model_id": os.getenv('MODEL_NAME'),
            "inputs": [prompt],
            "parameters": {
                "temperature": float(os.getenv('TEMPERATURE')),
                "max_new_tokens": int(os.getenv('MAX_OUTPUT_TOKENS')),
                "stream": True
            }
        }
        headers = {
            "Authorization": f"Bearer {os.getenv('GENAI_KEY')}",
        }
        response = requests.post(os.getenv('GENAI_API'), json=data, headers=headers, stream=True)

        def gen():
            content = ""
            if response.status_code == 200:
              for chunk in response.iter_content(chunk_size=4096):
                  try:
                      if chunk:
                          output_str = chunk.decode('utf-8')
                          if output_str.startswith('data: '):
                              output_str = output_str[len('data: '):]
                          data = json.loads(output_str)
                          generated_text = data['results'][0]['generated_text']
                          content += generated_text
                          yield CompletionResponse(text=content, delta=generated_text)
                  except Exception as ex:
                      print(str(ex))
            else:
                yield CompletionResponse(text="Network Error")

        return gen()

LLogan M

I think you were missing the delta

LLogan M

also, it should probably return a gen() function, just to be consistent with the other LLMs

BBar Haim

it works, i'm trying to understand what was changed

BBar Haim

other than the gen()

BBar Haim