Find answers from the community

Updated 2 years ago

Is there a way to set it to maximum

At a glance
The post asks if there is a way to set the output to the maximum, meaning to generate output until it reaches the end of the token limit or finishes answering. The comments discuss how to set the max_tokens parameter in the language model definition to limit the maximum output tokens, and how num_output is used to ensure there is room for the output tokens. Community members discuss potential issues with the prompt size being too large, and suggest reducing the chunk_size_limit in the ServiceContext object to address this. They also explore potential language-specific differences in token usage. Overall, the discussion focuses on optimizing the output length and handling large prompts.
Useful resources
Is there a way to set it to maximum? Meaning - output until it reaches the end of token limit or finishes to answer?
L
p
29 comments
Kind of!

So setting max_tokens in the LLM definition will set the max_tokens that the model will generate

However, it's important to note that it can't generate past it's context size (4096 tokens), which includes the prompt and context text and everything in the original input.

This is why num_output tokens is needed, because we ensure that every message sent to the LLM leaves room for num_output tokens
@Logan M num_output is used to limit our prompt to maximum 3,584 (4096-512) tokens, right? So in order to increase the room for output tokens I need to increase both num_output prompt_helper and max_tokens in LLM object, right?
You got it! 👍
I realize this is bad UX though, it should be improved at some point 🙂
I'm not sure why prompt is that big tho :/

from logs I can see that a very huge amount been sent to LLM:

Plain Text
INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 3620 tokens
So the rest of the space is coming from your document chunks as context to answer the queries.

You can specify something like chunk_size_limit=512 in the ServiceContext object. The default it quite large
@Logan M If I decrease it, will it just cut off the previous chunks, or will it somehow try to find an important chunk inside a chunk?

e.g.:

with default prompt size:

ABCDEFG

after I decrease the default prompt size:

ABCD

or

ACD
It will cut it off, but it will use some overlap (about 20 chars),

Overall the LLM usually deals with it just fine
ok. let me try. Thanks @Logan M 👍
Sorry, where to find what is the default chunk size? There is again None in the code in from_defaults function, that I'm using:

Plain Text
chunk_size_limit: Optional[int] = None
https://github.com/jerryjliu/llama_index/blob/main/gpt_index/langchain_helpers/text_splitter.py#L29

But, it can be reduced on the fly if the user query is long, inside the PromptHelper
@Logan M Thanks. Do I need to reduce it both in prompt_helper and in service_context?
Just in the service context should be good I think
@Logan M It still cuts the answer off... 😥 I increased the output token size and decreased the chunk size drastically, but still.. Could you help me to figure out what am I doing wrong?

Plain Text
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=20)
prompt_helper = PromptHelper(max_input_size=512, num_output=2048, max_chunk_overlap=20, chunk_size_limit=512)
llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY, max_tokens=2048)
llm_predictor = LLMPredictor(llm=llm)
service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper, chunk_size_limit=512)


by the way, my pipeline is from chatbot tutorial.
https://gpt-index.readthedocs.io/en/latest/guides/tutorials/building_a_chatbot.html
I would understand If either my prompt or the answer would be long enough, but they are very small:

prompt:

Plain Text
какой порядок приема в «Международный университет туризма и гостеприимства»


answer:

Plain Text
Порядок приема пакета документов поступающего в «Международный университет туризма и гостеприимства» определяется Приемной комиссией. Поступающий должен представить в Приемную комиссию оригиналы документов. Далее поступающему предос


Plain Text
INFO:gpt_index.token_counter.token_counter:> [query] Total LLM token usage: 2832 tokens


Or, is it a complete answer (according to LLM) ?
What kind of index are you using? If you are using a List or Tree index, the query will check more than one node (so more tokens than chunk size)

If you are using a vector index, it will check similarity_top_k nodes (default is 1)
@Logan M So to index the documents (I have 3), I'm using GPTSimpleVectorIndex,

then, on chat time, I'm using several, I guess:

GPTSimpleVectorIndex
ComposableGraph.from_indices <--- GPTListIndex

It's the same code as in the tutorial:

https://gpt-index.readthedocs.io/en/latest/guides/tutorials/building_a_chatbot.html
So you have 3 vector indeces, each with a top_k of 1, wrapped by a list index

So 3 nodes are retrieved for each query and sent to the LLM
@Logan M Yeah. So I have 2 thoughts:

  1. It could be a language issue. Because I tried the same code with English texts and the outputs are way bigger than Russian ones. Here is the example output:
Plain Text
1. Total installed capacity of the Group 2. Total assets and profits of the Group 3. Gearing ratio of the Group 4. Details of the Company's subsidiaries and associates 5. Details of the Company's joint ventures 6. Discussion and analysis of the Group's business review, performance, key factors of its results and financial performance 7. Risk factors and risk management of the Group 8. Prospect for future development of the Group 9. Critical accounting estimates used in the preparation of the financial statements 10. Areas involving a higher degree of judgement or complexity in the application of accounting policies 11. Assumptions and estimates that are significant to the consolidated financial statements 12. Details of the changes in property, plant and equipment of the Group 13. Details of the changes in share capital of the Company during the Year 14. Provisions for pre-emptive rights under the Articles of Association or the PRC laws 15. Annual electricity generation of the Group 16. Savings of standard coal and reduction in carbon dioxide emissions of the Group 17. Clean Development Mechanism of the Group 18. Environmental policies of the Group and compliance with relevant laws and regulations 19. Relationships with key stakeholders of the Group


  1. The output that was generated by LLM is complete (according to LLM). Even though it is not complete according to human.
So If 2nd is correct, then probably, I need to change the prompt
So, words get broken into tokens,

On average, for english, 100 tokens ~= 75 words

I'm guessing other language have different ratios 🤔

For example, you can test on this app: https://platform.openai.com/tokenizer

The english output above is 236 tokens

The cut-off response you shared in Cyrillic earlier is 254 tokens
So, I think it's issue #1
Alright, so it's not a big difference actually... It's inconvenient that I cannot see what've been sent to LLM... I could just debug from that.
The Llama Logger I mentioned earlier will keep track of that! Also, you can turn on debug logs to see directly in the terminal
Plain Text
import logging
import sys

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
@Logan M Yes, I turned the debug logs on, but still cannot see what've been sent to LLM:

Plain Text
Entering new AgentExecutor chain...

Thought: Do I need to use a tool? Yes

Action: Vector Index 29.06.2022 Правила Приема МУТиГ-2022 рус СОНГЫ.pdf

Action Input: порядок приема> 

[query] Total LLM token usage: 2832 tokens

> [query] Total embedding token usage: 15 tokens

Observation: 
Порядок приема пакета документов поступающего в «Международный университет туризма и гостеприимства» определяется Приемной комиссией. Поступающий должен представить в Приемную комиссию оригиналы документов. После приема документов Пред
Ah, langchain must be swallowing the logs?

You can test the index outside of langchain for debug if you want (you can use the same text to query as the action input above)
Seems like I need to clean my text as well, after parsing it looks messy:

Plain Text
 "\nПоложение о порядке предоставления койко-мест в общежитии обучающимся \nв НАО «Международный университет туризма и гостеприимства». Издание 2 \n \n \nСтраница 60 из 71 \n7.АДРЕСА И РЕКВИЗИТЫ СТОРОН \nНекомерческое акционерное общество \n«Международный университет туризма и \nгостеприимства» \nАдрес: город Туркестан, улица Рабига \nСултанбегим, №14 А \n \nБСН 190440033845, КБЕ 18, КНП 872 \nР/С KZ 2960 1000 1000 026 251  \nАО «НАРОДНЫЙ БАНК» \n \nПредседатель Правления - Ректор \n__________   Сакенов А.М.    \n«___»__________2022 год. \nПроживающий: \n \n__________________________________ \nУдостоверение личности \n№_________________ \nвыдано ____ от «__» _______ 20__ года  \nИИН: _____________________________ \nАдрес проживания: _________________ \n______________________________________ \nс Положением о предоставлении мест \nпроживания в Общежитии и Правилами \nвнутреннего распорядка Общежития \nознакомлен  \n________________________________________\n____________________________ \nподпись                                      Ф.И.О.  \n«___»__________2022 год. \n \n \n \n \n"
oh weird! Yea that would help with the answer quality for sure
Add a reply
Sign up and join the conversation on Discord