Oof, thats a rough one to debug
The error is somewhere in this block
I don't really know where token
is coming from... nothing uses/mentions a variable by that name ๐ค
is_function()
and put_in_queue()
and memory.put()
are also all very simple 1-2 line functions (and also no mention token
)
I'm not 100% sure how privateGPT implements sagemaker LLMs, but it might be related to that? Something to do with how the LLM is streaming is my guess
Did some digging, updated the github issue with the likely source
Iโve answer on the issue with more details based on your help ๐
hmmm, is the input to messages_to_prompt()
empty as well?
You can test that
messages_to_prompt()
function is working by trying
from llama_index.llms import ChatMessage
print(llm.messages_to_prompt([ChatMessage(role="user", content="Test")]))
@Logan M actually messages_to_prompt seems to be a class attribute that seems to be a pydantic Field
Iโll try your code to see if this output is empty.
If it is empty, do you have an idea on the source ?
If it is empty, it feels like a bug in llama index or the llm implementation in private gpt
Just checked and fairly confident it's not an issue in llama-index, at least not the latest version
It seems that you're right, when I'm running your code I can see (what it seems like) a correct output.
I'll try to dig a bit from the LLM side, but if you have any clue, it'll be great ! ๐
(in any case, you already help a lot, thanks !)
FYI : I think I've found the real issue : actually it seems that the user question is simply not passed to the LLM and we only pass the system message (i.e RAG context). That's why the LLM claim an empty input
definnitely an issue on the private GPT side
@Logan M I've digged more and I think I may have found the cause. I've been going into llama_index code used by private gpt and I've been going into
llama_index/chat_engine/context.py.
If I understand correctly, this is where the RAG information are retrieved and added to my context before my message, in the stream_chat method :
https://github.com/run-llama/llama_index/blob/e7090975a1807b6c30c132c65464cb51dba3804a/llama_index/chat_engine/context.py#L181However here, I have quite a long context. My context is correctly retrieved but after that, my message can't be retrieved by the
self._memory.get because the
initial_token_count seems to be greater than the
token_limit from the memory.
(I've tried to print out every important details in attachments)
So the result is that the message passed to the LLM contains only the context and not my question anymore
If so, is there any options to increase this token_limit ? Or any other solution that could be used ?
I may miss something, I'm not an expert in LLM and even less in web ui
Ah that makes a lot of sense!
You can definitely increase the token limit, however, there is a limit to how much you can increase it
memory = ChatMemoryBuffer.from_defaults(token_limit=3900)
Otherwise, you need to reduce how much context is being retrieved
Do you have any guidance on how to reduce the context retrieved ?
(may be some documentation or so)
hmmm I would need to dive into what private gpt is doing
Generally, the idea would be to either decrease the top k, the chunk size, or both
Ok thanks ๐
For now I'll stick with my increased memory and I'll answer on my github issue with all the foundings we have right now.
Thank you very much for your help, really appreciated !