LlamaIndex

Log inLog into community

Find answers from the community

Updated 2 years ago

Logan M 8260hey so im trying to use

Logan M 8260hey so im trying to use

At a glance

The community members are discussing issues with using a custom language model (LLM) in the create_llama_chat_agent function from the LangChain library. They are encountering errors and compatibility issues, but eventually figure out a way to use a custom LLM and achieve streaming over an API.

The community members also discuss challenges with handling memory for multi-user scenarios, and provide some suggestions like using a dictionary to store memory objects for each user. However, they note that this approach has its own complications.

Towards the end, one community member shares their excitement about setting up text generation inference and describes it as a "submachine" or "smg" that is "super optimized" and "firing" tokens.

·

@Logan Mhey, so im trying to use custom llm inside create_llama_chat_agent llm=customLLM() ac = create_llama_chat_agent( toolkit=toolkit, llm=llm, memory=memory, verbose=True ). im unable to use custom llm inside this.

I

L

100 comments

Its a question answering chain but whne i try to use it, it just gives me error

The llama index retriever is not compatible with langchain as far as I know 🤔

Ah okie

@Logan MI FIGURED OUT THE STREAMING OVER API!!! with custom LLM

the wrapper

figured out that queue stuff!!!

it worksss!

Yooooo big news!!

with success comes new problems

lol

im bout to try it in deploy mode

but i wonder how memory plays lol

Oo the chat engine is outt

Nice nice

Oh also @Logan M you know that iterator with pipeline is actually an issue thats why it was giving us that thread lock error

It wasnt us doing anything wrong it is an active issue. Its exactly due to shallow copy thing i was talking about

Ohhhh that explains a lot!

The guy is on vacation or smth so when he comes back he will look at it

Could also use the raw model/tokenizer to get around it too lol

Its on the transformers pipeline issues on github

Yea that too

That would work

But this required rhe whole custom llm lol

Wish langchain could have just implemented the predictor as a option would have made life so much easier which ig is what the chat engine is accomplishing so thats good. Looking forward to streaming part of it and then use it

But then also need to find out how to deal with memory for each person

Yea the memory will be annoying. I know some people create an ID for each user and store the memory object in redis somehow

from langchain.memory import ConversationBufferMemory
from langchain import LLMChain

create a dictionary to store the memory for each user

user_memory = {}

create a function to get the memory for a user

def get_user_memory(user_id):
if user_id not in user_memory:
user_memory[user_id] = ConversationBufferMemory()
return user_memory[user_id]

create the LLMChain for a user

user_id = "example_user"
llm_chain = LLMChain(llm=my_llm, memory=get_user_memory(user_id), prompt=my_prompt)

This

But then again how would you distinguis ids and create them as u go?

Hmm

I see what i can try

memory works but not mulit user lol

it fks up the whole thing

gota put more time into figureing that out now lol

Good thing is day by day things are being fixed and solutions are coming up for me

Slowly slowly i should achieve everything 🙂

Yea man! Every day another problem solved 💪

@Logan M bro i just setup text_generation inference

Omg

I just built a submachine

Actually more like an smg

Holy shit tokens are flying

How is that thing so good

And why did i not use this before

It's like, super optimized haha

Yeaaaaa

Bro its firinggg

Glad it works well!

Legit one command

Now i have a smg streaming tokens

Lmao

Are the models it supports very smart though?

Just curious

Wachu mean? U can provide ur own model

Lol

It legit sets up the whole thing as a predictor itself too so u can take the code and say llm = client()

Now u legit have a llm class none of that custom class bs

I can straight take that inside langchain

This thing legit just completely outdid the whole month of thing i have been fiddling with

Fuhh should have tried it when i asked u about it last time

I jist randomly saw it again and im like let me see

This thing is danm easy to setup

oh what, I thought it only supported certain models

hot damn

ohhh, certain architectures have special optimizations

Attachment

yeaaa

its actually nuts

im trying to implement it into flask

i have to change the structure lol cause it needs json or else it throws this token error

text_generation.errors.ValidationError: Input validation error: inputs tokens + max_new_tokens must be <= 1512. Given: 48 inputs tokens and 1500 max_new_tokens

weird i dont get it why its doing this

Plain Text

message = request.get_data()
    #message = message.replace("message=", "")
    def stream():
        for response in client.generate_stream("what is NLB?", temperature=0.1, max_new_tokens=1500):
            if not response.token.special:
                yield response.token.text

if i do this like hardcore the question it works fine

but when i use message variable it gives that error

@Logan M

Plain Text

Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/text_generation/client.py", line 251, in generate_stream
    response = StreamResponse(**json_payload)
  File "pydantic/main.py", line 341, in pydantic.main.BaseModel.__init__
pydantic.error_wrappers.ValidationError: 1 validation error for StreamResponse
token
  field required (type=value_error.missing)

Seems like somewhere thers a limit on the model for 1512 tokens. So the long message is causing an issue since max_new_tokens is so big (try lowering it to 512 or something smaller)

🤷‍♂️ not sure on that one lol seems like a problem with the text-generation library?

this isnt the problem cause its only coming with the variable for message

if i hardcode it then it works fine?

ohhhh

i seee

i wasnt reading it

it needs to be 1512 or less as a whole

interesting

cant seem to find this anywhere

ahhh its a hard limit inside the actual inference server code

wow

oh wow, that's super low

that's annoying

yeaa

danm thats bs

why would they do that

works now

check dm @Logan M

@Logan M this thing sits within llama index so nicely but still having that issue where its doing multiple calls to the server

I dont get why its making multiple calls when it gets the right answerr. Im no longer using custom llm class but the predictor style class thats made for the inference server its sick

Memory works so well! Now time to mess around with tools with langchain

Maybe enable internet with it hehe

Add a reply

Sign up and join the conversation on Discord