Chunk size

At a glance

The community member has set the chunk size to 1024 in the service context step, but is getting an error that the token indices sequence length is longer than the specified maximum sequence length for the model (1043 > 512). The community member has printed the chunk size limit, which shows it is set to 1024, but it seems the chunk size is being overridden with a default value.

The comments suggest that this "error" is actually just a warning from the transformers library, and the chunk size is usually approximate. The community member is using the HuggingFaceLLMPredictor class with a max input size of 2048, but the actual chunk size during querying is 512.

The community member has tried changing the chunk size value, but the issue persists. They have also shared their service context and prompt helper setup, which includes using the HuggingFaceEmbeddings model with a max input size of 512.

The comments suggest that if the chunk size is changed, the community member will need to re-embed the data. There is also a suggestion that the issue may be related to a transformers library update, which could have broken the flow.

WWhiteFang_Jr

I have set my chunksize to 1024 in the service context step, but when I'm querying
Getting this error

Token indices sequence length is longer than the specified maximum sequence length for this model (1043 > 512). Running this sequence through the model will result in indexing errors

Printed the chunk size limit as well while starting the server

Plain Text

print(service_context.chunk_size_limit)

Output: 1024

It looks like Chunk size limit is getting overrided with some default value, But default value set for chunk size is 1024

15 comments

LLogan M

That "error" isn't actually an error. Python3.8 uses a transformers tokenizer and that's just a warning coming from transformers (since it's a gpt2 tokenizer)

LLogan M

The chunk size is usually approximate tbh. 1043 is close, the token splitting code is a little complex lol

WWhiteFang_Jr

Kept this in a try section, it is failing 😅

LLogan M

oh, maybe this is something else then 😅 Now that I look at it again, I see 512 is the max input size. Are you using a custom LLM anywhere?

WWhiteFang_Jr

I'm using HuggingFaceLLMPredictor class for the llm

WWhiteFang_Jr

Plain Text

        hf_predictor = HuggingFaceLLMPredictor(
                        max_input_size=2048,
                        max_new_tokens=256,
                        temperature=0.25,
                        do_sample=False,
                        query_wrapper_prompt=query_wrapper_prompt,
                        tokenizer_name="Writer/camel-5b-hf",
                        model_name="Writer/camel-5b-hf",
                        device_map="auto",
                        tokenizer_kwargs={"max_length": 2048},
                        model_kwargs={"torch_dtype": torch.bfloat16}
                    )

WWhiteFang_Jr

I tried changing the chunk size value from 1024 to 1300 to 1500 but While actual querying it puts 512 as the chunk size

LLogan M

and what does your service context/prompt helper setup look like again?

WWhiteFang_Jr

Plain Text

embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

service_context = ServiceContext.from_defaults(chunk_size_limit=1024, llm_predictor=hf_predictor, embed_model=embed_model)

WWhiteFang_Jr

using 0.6.15

LLogan M

Ohhhh ok, so mpnet-base-v2 has an input size limit of 512, which is where the warning is coming from.

But putting in longer sequences shouldn't stop program execution, at least it didn't the last time I ran that model 👀

WWhiteFang_Jr

Yeah I ran it 2 days back, It worked,

I created new env today and used the same 0.6.15 as new version has some changes for vector store names which I havent updated in the code currently. But it started to break at this point

WWhiteFang_Jr

Initially the chunk size was set at 512 only but since I started getting this error, I tried changing it to different set of values to test

WWhiteFang_Jr

If I change the chunk size do I need to create embeddings again?

LLogan M

I wonder if transformers updated and broke this flow 😦

Yea if you change the chunk size, you'll need to re-embed again

Add a reply

Find answers from the community

Chunk size