Find answers from the community

Updated 9 months ago

Logging

Feel like this is very basic, but cant quite get it to work - how can I see exactly the prompt that is actually sent to the LLM - that is, the exact system message and user message wth all the docs filled in, etc. Something is going wrong and I need to debug it and i wan tto start by just verifying the exact string sent to the LLM
W
e
L
15 comments
Thanks! This is great! And showed me why something like "Tell me a joke" made so many LLM calls..... as chat engine is using ChainofThought to break this down across only a single tool - my retriever? Very strange default behavior!

Side note - is there a DEBUG option to just dump the entire json object sent to the LLM? I'm using an OpenAILike LLM - so ultimately there is a json/list of dicts sent to the client here:
Attachment
image.png
Would be nice to just dump message_dicts out
Very curious what kind of setup you had that would do that (A react agent and an open-source LLM maybe?)
I have an open source LLM hosted on vLLM in OpenAI compat mode - so I'm using OpenAILike. Was experimenting with chat engine and somehow the ReAct one caused that very strange behavior
Couple this with an earlier "issue" that the embedding vector was being included as metadta to the LLM made for some very strange behavior indeed!
@Logan M - in general having a bit of a hard time getting llama-index to work with our existing "microservices" arch - I already have an OpenSearch index in place - ingestion is handled elsewhere (may consider moving it in to llama-index in the future). Already have an embedding service - had to write a custom embedder that makes http calls to this service to get embeddings. Already have a hosted LLM as described above.
@Logan M - I already said this in general but for more context to the above issue:

I buit a custom RAG Chat app which takes a query, compresses it in the context of a chat history, and then decides if a rag is needed. If it does, it performs it and gives the retrieved chunks, chat history, and current message to a final LLM. If it doesn't, it doesn't perform a RAG, instead - just passing the chat history and current message to an LLM.

As far as I can tell, the Chat Engine - Condense Plus Context mode is the most similar to what I've built - except it RAGs every time. Since my docs are very "focused", I really won't need to RAG every time. Especially for things like "what kinds of things can you do?" and "who are you?" which I currently handle with some identity information in the system prompt

So I need a kind of Conditional-Condese-Plus- Context chat mode

Perhaps something like best or ReAct can work, but in our setting we need to reduce time to first token the user sees, so we want to avoid extraneous LLM calls and that seems hard with agents.
ah, handling ingestion outside of llama-index kind of explain the weird behaviour with the nodes including the embedding vectors πŸ˜…
You could just build your own chat loop tbh (it sounds like you mostly have already?)

Performing a retrieve every time is pretty cheap in terms of time -- removing the call to decide if to retrieve or not would speed things up quite a bit. But I'm guessing including retrieved context, even when its not technically needed, might be confusing your LLM?
In terms of building or own chat loop, its pretty straightforward tbh

  • get current chat history from a ChatMemoryBuffer
  • add new user message to the chat history list
  • peform operations (decide to retrieve, query rewrite, etc.)
  • If retrieving, stuff the context into the chat history (either a system prompt or in the latest user message)
  • make final LLM call
  • add the users original message + LLM response to chat memory buffer
I did something similar with our query pipeline syntax
https://docs.llamaindex.ai/en/stable/examples/pipeline/query_pipeline_memory/?h=query+pipeline

You could of course just write the loop imperitively as well
Re: "But I'm guessing including retrieved context, even when its not technically needed, might be confusing your LLM?"

yeah for things like "Hi how are you - what can you do" etc. Answers were way off - much better to have no docs and just let base LLM answer with 1 or 2 sentences about "identity" in the system prompt
I was trying to build my own chat loop, spent a few on query pipeline but couldnt figure out the system prompt piece, so turning to doing it fully imperatively.

I built it all manually from just python - but looking to re-creat in an established library for easier "maintenance" and experimentation - ie - switching agent types, etc.

Mostly - it came time to do observability/instrumentation, and instead of adding it manually to my from scratch implemtation, I can use the callbacks in llama-index (or langchain) and get it "for free" (mostly) - tha was the big impetus for me - observability/instrumentation
Re:

get current chat history from a ChatMemoryBuffer
add new user message to the chat history list
peform operations (decide to retrieve, query rewrite, etc.)
If retrieving, stuff the context into the chat history (either a system prompt or in the latest user message)
make final LLM call
add the users original message + LLM response to chat memory buffer

Exactly what I do in just pure python lol
Happy to help you figure out the system prompt piece -- shouldn't be so bad πŸ™‚

And yes the new instrumentation stuff should be pretty valuable tbh -- we just made the instrumentation thread-safe and coroutine safe too. I'm sure you saw, but you can also create your own events/span, which might be nice for the more imperitive loop I was describing
Add a reply
Sign up and join the conversation on Discord