Hi Logan M I m currently using a

Hi @Logan M , I'm currently using a specific function from llamaindex, but there is a problem with the input

80 comments

here is my actual code:

Plain Text

strmlt.session_state.llm=OpenAI(temperature=0.1, model="gpt-3.5-turbo-16k", max_tokens=4096)
        strmlt.session_state.embeddings=HuggingFaceEmbeddings(
            model_name="intfloat/multilingual-e5-large",
            model_kwargs={"device": "cuda"})
        strmlt.session_state.service_context = ServiceContext.from_defaults(
        llm=strmlt.session_state.llm,
        embed_model=strmlt.session_state.embeddings,
        context_window = 16384,
        num_output=1024,
        )
        strmlt.session_state.index = VectorStoreIndex.from_documents(documents,service_context=strmlt.session_state.service_context,show_progress=True, similarity_top_k=3)
        #strmlt.session_state.conversation = strmlt.session_state.index.as_query_engine()
        query_engine_builder = QASummaryQueryEngineBuilder(service_context=strmlt.session_state.service_context)
        query_engine = query_engine_builder.build_from_documents(documents)
        strmlt.session_state.chat_history=[]
        strmlt.session_state.conversation = CondenseQuestionChatEngine.from_defaults(
            query_engine=query_engine, 
            chat_history=strmlt.session_state.chat_history,
            verbose=True
        )

RRAPHCVR

So with this method, I got a big problem: the input size is way biger than the max context token: 16k

RRAPHCVR

but with the classical chat engine, the problem does not appear

RRAPHCVR

how to solve this issue ?

RRAPHCVR

openai.error.InvalidRequestError: This model's maximum context length is 16385 tokens. However, you requested 17476 tokens (13380 in the messages, 4096 in the completion). Please reduce the length of the messages or completion.

RRAPHCVR

is the error I got

RRAPHCVR

I'm using a batch of documents

RRAPHCVR

(41 txt files)

WWhiteFang_Jr

You are getting this error in the first query?

RRAPHCVR

Yep

RRAPHCVR

I'm querying to resume

RRAPHCVR

when I try to put this setting: chunk_size_limit=10000

RRAPHCVR

it runs without stopping in an infinite loop

RRAPHCVR

(no response)

WWhiteFang_Jr

Ah yes! You need to minimise the chunk size, default is 1024.
Since you are setting such a high value for chunk size, GPT is not able to generate more texts.

RRAPHCVR

The goal so is first to resume the context (41 documents)

RRAPHCVR

then to adapte and refine the question of the user, calling the appropriate query engine

RRAPHCVR

it works well when there are not as much text

RRAPHCVR

and in all situation when I don't ask to query

RRAPHCVR

ok so

RRAPHCVR

Finally, the chunk_size_limit=10000 works, but Idk why

WWhiteFang_Jr

basically during response synthesize stage, All the top_K similar sources are stacked together to be sent to LLM along with your query.

Let say you have set chunk size 2048 and you got 5 top_K source text from the indexes.

These five will be stacked one by one like
Source node 1 + source node 2 +...+ source node 5 + your query

This will be passed to LLM, Since you have set max_output tokens to 4096 so GPT requires this much tokens to be free from the total allowed size for the model, In your case its 16K.

If you make chunk size greater, You will also consume the 4096 output token space.

RRAPHCVR

Yep I understand

RRAPHCVR

but why this works also ?

RRAPHCVR

It's the global context of documents size that is chunked ?

RRAPHCVR

or is it just cutting off some context

WWhiteFang_Jr

Yes when you define your service_context with chunk_size it will chunk the inserted documents in that size only.

RRAPHCVR

si that's what the "chunk_size_limit" do ?

RRAPHCVR

or is chunk size limit=chunk size ?

WWhiteFang_Jr

Yes both are same, Llamaindex is goign to deprecate chunk_size_limit

RRAPHCVR

so putting this to 10000 is a good way for my summarization purposes ?

RRAPHCVR

ok I see

RRAPHCVR

or is there a better method

WWhiteFang_Jr

With 10K are you able to make multiple queries?

RRAPHCVR

yep, but it's very long

WWhiteFang_Jr

I would reduce the chunk size to a small number.

RRAPHCVR

ok but it's not truncating the text randomly, it's doing this in an intelligible way ?

RRAPHCVR

or is it doing multiple prompts ?

RRAPHCVR

to fill the base context token ?

RRAPHCVR

then resume the multiple prompt each others

WWhiteFang_Jr

Yes. It truncates the text based on the size mentioned or uses default way.

RRAPHCVR

wdym by this ?

RRAPHCVR

I wanted to use the embedding model to define the best truncating of the documents

WWhiteFang_Jr

Truncating the text from the documents happens during index construction state

Whereas For responses generation, It picks your top_k similar documents which are chunked docs

RRAPHCVR

Ok so only 3 documents to make a resume of all ?

WWhiteFang_Jr

That depends on your document totally. If it is a large resume it will be chunked into more documents object. If it is small it can be chunked into 1 doc only.

RRAPHCVR

Because you tell me that it's looking on the 3 best docs

RRAPHCVR

If I put topk=3

RRAPHCVR

So it will be limited no ?

RRAPHCVR

If there is 41 documents

RRAPHCVR

Or is it doing something else ?

WWhiteFang_Jr

I think you got confused here lol 😅

Wait let me rephrase

RRAPHCVR

Yes sorry :sadcat:

WWhiteFang_Jr

first stage: Index construction

Let say you have a 1 resume file,
Let say you have set chunk_size = 1024
Let say you have set top_K value as 2
Let say your resume if total 3000 tokens,

During index construction, This resume will be chunked into 3 separate docs, For the single resume.

This was for chunking of document, Even if you provide N number of documents, it will chunk each documents in the similar fashion.

#2nd Stage: During query
You query on your resume,asking from where did Adam did his schooling from.
Cosine similarity will be applied to find the top two similiar chunked docs out of all the docs and

they will be combined together to form a new Query that will be passed to LLM of your choice.

RRAPHCVR

Ok I see better now

RRAPHCVR

But also, I was thinking it works in a different way for summary purposes

RRAPHCVR

Attachment

Screenshot_2023-08-23-13-46-47-665_org.mozilla.firefox.jpg

RRAPHCVR

This function

RRAPHCVR

Cause my goal is to do a resume

RRAPHCVR

The current query engine is selecting the best query engine to answer: resume or a specific question

RRAPHCVR

What I was meaning here is that only 3 documents out if all would be taken to summarise 41 docs

RRAPHCVR

Also, the chunk size limit does not appears to be the same as the chunk size option of servicecontext

RRAPHCVR

indeed, I always have the error with the chunk size option, but not with the chunk size limit option

RRAPHCVR

When using the function with a local based llama model

RRAPHCVR

I have this error:

RRAPHCVR

Attachment

LLogan M

https://github.com/jerryjliu/llama_index/blob/146802075586a29d42e3e71ef35a3016e2caad35/llama_index/indices/service_context.py#L116

Chunk size and chunk size limit are the exact same option.

But they only apply to building new indexes

LLogan M

This looks like an error with the whisper loader 🤷‍♂️

RRAPHCVR

No, whisper does not matter here

RRAPHCVR

it's just the python env that is in a folder named like this, but nothing to do with whisper

RRAPHCVR

Ok, I would like to know why I have the error with one and not the the other also, where is it from ?

RRAPHCVR

also, it works with openai

LLogan M

I have no idea, this conversation is a little long/messy lol

I see you are using turbo-16k, with max tokens set to 4096. But then you set num_outputs to 1024. Both these numbers should be the same (I suggest setting both to 1024)

RRAPHCVR

Yep indeed, sry, thanks for all of these answers

RRAPHCVR

but I don't understand why putting context length and output length the same ?

RRAPHCVR

it would do a token error

RRAPHCVR

(or context/output could be 0)

LLogan M

max_tokens isn't context length, it's the number of tokens that the model can generate before getting cut off

num_output is what leaves room for those tokens in the input.

If these two numbers are not the same, then the model might try to generate more tokens than it has room for, leading to token errors

RRAPHCVR

ow ok you mean on the openai settings I see

Add a reply

Find answers from the community

Hi Logan M I m currently using a

first stage: Index construction