Hey everyone, just starting my LLM journey in trying to get my own chatbot running with llama-2-chat and had a few general and llamaindex specific questions I was hoping someone could help me with: 1) This probably seems like a naive question but it's never explicit anywhere - does the system prompt itself reduce the amount of tokens available? I.e. if i instantiate my model with 2048 tokens and my system prompt is 48 tokens I only have 2000 tokens left to work with right? 2) What is the best way of passing a custom prompt to llama 2 via llama-cpp-python (and the new associated llamaindex wrapper)? I'm unclear if system tokens should or should not be appearing in my output with the new default prompts 3) What exactly is a good way of dealing with historical memory / large knowledge bases? In a conversational chatbot ideally it should be able to keep track of things previously said, is this just as simple as adding a running history of inputs and outputs as part of the prompt? Won't this eventually blow out my token limit? Similarly in trying to summarise or aggregate across large knowledge bases how can I get my model to summarise a 20 page pdf if trying to provide that information goes beyond the token size available?
Yea, it will essentially subtract like you describe
You can use something like this. We actually have specific utils for llama2-chat formatting because it's rather complicated. If you use another fine-tuned variant or anything model, you can pass in your own cutsom function to re-format the text before passing to the LLM