Hi everyone, I having some issues

At a glance

Hi everyone, I having some issues implementing RAG using LlamaIndex with my LLama2 cpp Custom LLM Model (Docker deployed using Flask in GCP)

The main problem is multiple requests made by LlamaIndex to my API, which increase a lot the response time in comparison to a simple request.

Anyone can help with some guidance?

I'm following this example in documentation (Example: Using a Custom LLM Model - Advanced):
https://docs.llamaindex.ai/en/stable/module_guides/models/llms/usage_custom/#example-using-a-custom-llm-model---advanced

Plain Text

class OurLLM(CustomLLM):
    context_window: int = 1200
    num_output: int = 256
    model_name: str = "custom"
    dummy_response: str = "My response"

    @property
    def metadata(self) -> LLMMetadata:
        """Get LLM metadata."""
        return LLMMetadata(
            context_window=self.context_window,
            num_output=self.num_output,
            model_name=self.model_name,
        )
    
    def llama2_13b_gcp(self, role_user):
        """

        """
        url = "http://00.000.000.000:8080/api/chat"

        headers = {
            "accept": "application/json",
            "Content-Type": "application/json"
        }
        data = {
            "messages": [
                {"role": "system", "content": "Answer the question"},
                {"role": "user", "content": role_user}
            ]
        }

        response = requests.post(url, headers=headers, json=data)

        resp = response.json()['choices'][0]['message']['content']

        return resp
    
    @llm_completion_callback()
    def complete(self, prompt: str, **kwargs: Any) -> CompletionResponse:
        self.dummy_response = self.llama2_13b_gcp(prompt) 
        return CompletionResponse(text=self.dummy_response)

    @llm_completion_callback()
    def stream_complete(self, prompt: str, **kwargs: Any) -> CompletionResponseGen:
        full_response = self.llama2_13b_gcp(prompt)  
        yield CompletionResponse(text=full_response)

4 comments

LLogan M

your context window is only 1200 -- this is very small, and will result in the response synthesizer needing to make many LLM calls in order to give the LLM a chance to read all the returned text

LLogan M

I would suggest greatly lowering the chunk (256?), possibly lowering the top k

LLogan M

Or, just allowing a larger context window (llama2 has a window of 4096 no?)

LLucas Vasconcelos Rocha

Hi Logan, thanks for the response.
Yes, LLama 2 has a 4096 context window, but I received an error in my API when I used this number that didn't happen when I reduced it to 1200.
This is the error:

Plain Text

ggml_allocr_alloc: not enough space in the buffer (needed 289444000, largest block available 27545600)

But now I reduce the top k and receive a response in less than 20 seconds, that was enough for me, Thanks.

If you have any clue about this buffer error please tell me.
I already increased my RAM in VM to 500GB and 4 GPUs T4, but the error continues the same when I use 4096 as context window

Thanks!

Add a reply

Find answers from the community

Hi everyone, I having some issues