Find answers from the community

Updated 3 months ago

Hey all, I'm trying to wrap my head

Hey all, I'm trying to wrap my head around how to manage simultaneous user chats at the sme time in Llamaindex and keep their chat histories and contexts separate. Is it just as simple as instantiating a new chat_engine object every time there's a new chat from a different user? And how do you keep them apart in code, just keep a list in memory of all the objects and call the chat function on the right object based on an ID or something? Anyone aware of any examples of this? Seems like a core use case for any sort of production setting that a single instance of a llamaindex program should be able to spin up and manage different chats at the same time and keep them separate.
t
L
D
13 comments
this might be related to what you're discussing although I'm not a pro on the matter. this video is about openai assistants but discusses the same issue... https://youtu.be/0h1ry-SqINc?si=FARfIRVLRsJoh1CY&t=557
Also, slightly related, you can pass in the chat history manually to each chat() call. And you can also easily get the same chat history from the chat engine

Throw this into redis or something, and off you go

chat_history is a list of ChatMessage objects, and each have a .json() and .dict() method (they are pydnatic objects)

Plain Text
chat_history = chat_engine.chat_history

response = chat_engine.chat("Hello!", chat_history=chat_history)
Thanks, glancing at it, it looks like a lot of managing of individual threads here is based on managing the threads at the OpenAI API level (ie referring to different threads with their own contexts etc). From what I can tell the OpenAI API is kinda buried a bit in LLamaindex, so not sure if this level of like retrieval of thread IDs/startig new threads via the OpenAI API is exposed. Would wonder if cleaner to just manage the chat engine objects themselves in LI, since they also seem to maintain their own contexts/chat histories etc.
Yeah thanks @Logan M, so instead of managing multiple chat objects, you're just talking about keeping track of what chat histories are related to what conversations, storing those separately by conversation ID or whatever, and passing that in every time you call the engine.
Finding a lot of useful info just by reviewing https://github.com/run-llama/llama_index/blob/main/llama_index/chat_engine/context.py which I also think you referred to
Yup! I think you are on the right track

Reading source code is the best documentation tbh πŸ˜‰
THanks! So just to be clear, there really is nothing else aside from the prompts (Sytem prompt etc) and the chat history, plus the user's message, that essentially is the "entire" package being sent up to OpenAI, it's not really tracking any conversation on the OpenAI side, it's essentially stateless and that package of state is all in the LLamaindex chat objects? Trying to (like everyone else I think) manage things "falling out of context" as well, so this also helps me think about other solutions. Is the Chat History the entire history since the beginning, regardless of context window size, and it's just truncated to the latest like 16K tokens for GPT 3.5 turbo? (ie my idea is generating a continuous changing summary that is consistently fed in via the prompt with every message, if the context window size supports it), or even storing the chat history externally as another RAG document, though that'd require re-indexing I guess.
Though.... could spin up another instance of Llamaindex that just indexes the entire chat history at some interval, then generates a summary for the main model, with a length limit. πŸ™‚
Yea your understanding is correct!

The entire chat history is in a sliding window buffer. You can certainly extend the base class and design your own memory module though
https://github.com/run-llama/llama_index/blob/acd344104188cac7022f71b8887b8b38dae4ec19/llama_index/chat_engine/context.py#L76
Not sure if there are best practices here, I know this is its own whole field of effort (ie MemGPT).
(We only have this single memory module, so if you make your own, I highly encourage a PR ❀️ )

Yea there's quite a few strats for memory tbh. I think sliding window + retrieval makes some sense (long vs short term)
Just haven't implemented it yet
I'll definitory explore it, and if I stumble on a strategy I'll definitory share. Oh... the other thing is sometimes I find Llamaindex seems to stop referencing the documents as frequently, and I've tried re-loading the index from the saved embeddings. Is that all just placebo or does that do anything? It seems to clear the context history anecdotally, which I guess makes snse as I'm instantiating a new chat_engine which would wipe history. I guess I still don't fully get how RAG works the documents into the context sent to OpenAI. I assume its more than just Chat_History + prompt, as retrieval is happening as well yeah? Perhaps its just an artifact of the chat history drifting such that it overpowers the RAG results?
Add a reply
Sign up and join the conversation on Discord