The community member is starting their journey with LLMs and has questions about using llama-2-chat and llamaindex. The main points are:
1) The system prompt reduces the available tokens for the model, so if the model has 2048 tokens and the prompt is 48 tokens, the member would have 2000 tokens left to work with.
2) The community member is unclear on the best way to pass a custom prompt to llama 2 via llama-cpp-python and the llamaindex wrapper, and whether system tokens should appear in the output.
3) The community member is asking about the best way to handle historical memory and large knowledge bases in a conversational chatbot, such as adding a running history of inputs and outputs to the prompt, and how to summarize large documents without exceeding the token limit.
In the comments, a community member provides some guidance on using specific utils for llama2-chat formatting, and mentions that the chat engines/agents use a basic memory buffer with a defined token limit, where the history is cut off if it exceeds the limit.
Hey everyone, just starting my LLM journey in trying to get my own chatbot running with llama-2-chat and had a few general and llamaindex specific questions I was hoping someone could help me with: 1) This probably seems like a naive question but it's never explicit anywhere - does the system prompt itself reduce the amount of tokens available? I.e. if i instantiate my model with 2048 tokens and my system prompt is 48 tokens I only have 2000 tokens left to work with right? 2) What is the best way of passing a custom prompt to llama 2 via llama-cpp-python (and the new associated llamaindex wrapper)? I'm unclear if system tokens should or should not be appearing in my output with the new default prompts 3) What exactly is a good way of dealing with historical memory / large knowledge bases? In a conversational chatbot ideally it should be able to keep track of things previously said, is this just as simple as adding a running history of inputs and outputs as part of the prompt? Won't this eventually blow out my token limit? Similarly in trying to summarise or aggregate across large knowledge bases how can I get my model to summarise a 20 page pdf if trying to provide that information goes beyond the token size available?
Yea, it will essentially subtract like you describe
You can use something like this. We actually have specific utils for llama2-chat formatting because it's rather complicated. If you use another fine-tuned variant or anything model, you can pass in your own cutsom function to re-format the text before passing to the LLM