Hey everyone just starting my LLM

Hey everyone, just starting my LLM journey in trying to get my own chatbot running with llama-2-chat and had a few general and llamaindex specific questions I was hoping someone could help me with:
1) This probably seems like a naive question but it's never explicit anywhere - does the system prompt itself reduce the amount of tokens available? I.e. if i instantiate my model with 2048 tokens and my system prompt is 48 tokens I only have 2000 tokens left to work with right?
2) What is the best way of passing a custom prompt to llama 2 via llama-cpp-python (and the new associated llamaindex wrapper)? I'm unclear if system tokens should or should not be appearing in my output with the new default prompts
3) What exactly is a good way of dealing with historical memory / large knowledge bases? In a conversational chatbot ideally it should be able to keep track of things previously said, is this just as simple as adding a running history of inputs and outputs as part of the prompt? Won't this eventually blow out my token limit? Similarly in trying to summarise or aggregate across large knowledge bases how can I get my model to summarise a 20 page pdf if trying to provide that information goes beyond the token size available?

Huge thanks in advance

Yea, it will essentially subtract like you describe

You can use something like this. We actually have specific utils for llama2-chat formatting because it's rather complicated. If you use another fine-tuned variant or anything model, you can pass in your own cutsom function to re-format the text before passing to the LLM

https://gpt-index.readthedocs.io/en/latest/examples/llm/llama_2_llama_cpp.html#setup-llm

By default our chat engines/agents use a basic memory buffer. You define some token limit, and if the chat history goes beyond that limit, then the history is just cut off. https://github.com/jerryjliu/llama_index/blob/1a7676e776952055a88d26d2a104fbf98a5c8552/docs/examples/chat_engine/chat_engine_context.ipynb#L120

Note that if you don't define/pass in the memory, it still gets used, but the token limit defaults to 75% of the LLM context window

Find answers from the community

Hey everyone just starting my LLM