strmlt.session_state.llm=OpenAI(temperature=0.1, model="gpt-3.5-turbo-16k", max_tokens=4096)
strmlt.session_state.embeddings=HuggingFaceEmbeddings(
model_name="intfloat/multilingual-e5-large",
model_kwargs={"device": "cuda"})
strmlt.session_state.service_context = ServiceContext.from_defaults(
llm=strmlt.session_state.llm,
embed_model=strmlt.session_state.embeddings,
context_window = 16384,
num_output=1024,
)
strmlt.session_state.index = VectorStoreIndex.from_documents(documents,service_context=strmlt.session_state.service_context,show_progress=True, similarity_top_k=3)
#strmlt.session_state.conversation = strmlt.session_state.index.as_query_engine()
query_engine_builder = QASummaryQueryEngineBuilder(service_context=strmlt.session_state.service_context)
query_engine = query_engine_builder.build_from_documents(documents)
strmlt.session_state.chat_history=[]
strmlt.session_state.conversation = CondenseQuestionChatEngine.from_defaults(
query_engine=query_engine,
chat_history=strmlt.session_state.chat_history,
verbose=True
)
So with this method, I got a big problem: the input size is way biger than the max context token: 16k
but with the classical chat engine, the problem does not appear
how to solve this issue ?
openai.error.InvalidRequestError: This model's maximum context length is 16385 tokens. However, you requested 17476 tokens (13380 in the messages, 4096 in the completion). Please reduce the length of the messages or completion.
I'm using a batch of documents
You are getting this error in the first query?
when I try to put this setting: chunk_size_limit=10000
it runs without stopping in an infinite loop
Ah yes! You need to minimise the chunk size, default is 1024.
Since you are setting such a high value for chunk size, GPT is not able to generate more texts.
The goal so is first to resume the context (41 documents)
then to adapte and refine the question of the user, calling the appropriate query engine
it works well when there are not as much text
and in all situation when I don't ask to query
Finally, the chunk_size_limit=10000 works, but Idk why
basically during response synthesize stage, All the top_K similar sources are stacked together to be sent to LLM along with your query.
Let say you have set chunk size 2048 and you got 5 top_K source text from the indexes.
These five will be stacked one by one like
Source node 1 + source node 2 +...+ source node 5 + your query
This will be passed to LLM, Since you have set max_output
tokens to 4096 so GPT requires this much tokens to be free from the total allowed size for the model, In your case its 16K
.
If you make chunk size greater, You will also consume the 4096 output token space.
but why this works also ?
It's the global context of documents size that is chunked ?
or is it just cutting off some context
Yes when you define your service_context
with chunk_size
it will chunk the inserted documents in that size only.
si that's what the "chunk_size_limit" do ?
or is chunk size limit=chunk size ?
Yes both are same, Llamaindex is goign to deprecate chunk_size_limit
so putting this to 10000 is a good way for my summarization purposes ?
or is there a better method
With 10K are you able to make multiple queries?
I would reduce the chunk size to a small number.
ok but it's not truncating the text randomly, it's doing this in an intelligible way ?
or is it doing multiple prompts ?
to fill the base context token ?
then resume the multiple prompt each others
Yes. It truncates the text based on the size mentioned or uses default way.
I wanted to use the embedding model to define the best truncating of the documents
Truncating the text from the documents happens during index construction state
Whereas For responses generation, It picks your top_k similar documents which are chunked docs
Ok so only 3 documents to make a resume of all ?
That depends on your document totally. If it is a large resume it will be chunked into more documents object. If it is small it can be chunked into 1 doc only.
Because you tell me that it's looking on the 3 best docs
So it will be limited no ?
Or is it doing something else ?
I think you got confused here lol π
Wait let me rephrase
first stage: Index construction
Let say you have a 1 resume file,
Let say you have set chunk_size = 1024
Let say you have set top_K value as 2
Let say your resume if total 3000 tokens,
During index construction, This resume will be chunked into 3 separate docs, For the single resume.
This was for chunking of document, Even if you provide N number of documents, it will chunk each documents in the similar fashion.
#2nd Stage: During query
You query on your resume,asking from where did Adam did his schooling from.
Cosine similarity will be applied to find the top two similiar chunked docs out of all the docs and
they will be combined together to form a new Query that will be passed to LLM of your choice.But also, I was thinking it works in a different way for summary purposes
Cause my goal is to do a resume
The current query engine is selecting the best query engine to answer: resume or a specific question
What I was meaning here is that only 3 documents out if all would be taken to summarise 41 docs
Also, the chunk size limit does not appears to be the same as the chunk size option of servicecontext
indeed, I always have the error with the chunk size option, but not with the chunk size limit option
When using the function with a local based llama model
This looks like an error with the whisper loader π€·ββοΈ
No, whisper does not matter here
it's just the python env that is in a folder named like this, but nothing to do with whisper
Ok, I would like to know why I have the error with one and not the the other also, where is it from ?
also, it works with openai
I have no idea, this conversation is a little long/messy lol
I see you are using turbo-16k, with max tokens set to 4096. But then you set num_outputs to 1024. Both these numbers should be the same (I suggest setting both to 1024)
Yep indeed, sry, thanks for all of these answers
but I don't understand why putting context length and output length the same ?
it would do a token error
(or context/output could be 0)
max_tokens isn't context length, it's the number of tokens that the model can generate before getting cut off
num_output is what leaves room for those tokens in the input.
If these two numbers are not the same, then the model might try to generate more tokens than it has room for, leading to token errors
ow ok you mean on the openai settings I see