Find answers from the community

Updated 2 years ago

Chunk size

At a glance

The community member has set the chunk size to 1024 in the service context step, but is getting an error that the token indices sequence length is longer than the specified maximum sequence length for the model (1043 > 512). The community member has printed the chunk size limit, which shows it is set to 1024, but it seems the chunk size is being overridden with a default value.

The comments suggest that this "error" is actually just a warning from the transformers library, and the chunk size is usually approximate. The community member is using the HuggingFaceLLMPredictor class with a max input size of 2048, but the actual chunk size during querying is 512.

The community member has tried changing the chunk size value, but the issue persists. They have also shared their service context and prompt helper setup, which includes using the HuggingFaceEmbeddings model with a max input size of 512.

The comments suggest that if the chunk size is changed, the community member will need to re-embed the data. There is also a suggestion that the issue may be related to a transformers library update, which could have broken the flow.

I have set my chunksize to 1024 in the service context step, but when I'm querying
Getting this error
Token indices sequence length is longer than the specified maximum sequence length for this model (1043 > 512). Running this sequence through the model will result in indexing errors

Printed the chunk size limit as well while starting the server
Plain Text
print(service_context.chunk_size_limit)

Output: 1024


It looks like Chunk size limit is getting overrided with some default value, But default value set for chunk size is 1024
L
W
15 comments
That "error" isn't actually an error. Python3.8 uses a transformers tokenizer and that's just a warning coming from transformers (since it's a gpt2 tokenizer)
The chunk size is usually approximate tbh. 1043 is close, the token splitting code is a little complex lol
Kept this in a try section, it is failing πŸ˜…
oh, maybe this is something else then πŸ˜… Now that I look at it again, I see 512 is the max input size. Are you using a custom LLM anywhere?
I'm using HuggingFaceLLMPredictor class for the llm
Plain Text
        hf_predictor = HuggingFaceLLMPredictor(
                        max_input_size=2048,
                        max_new_tokens=256,
                        temperature=0.25,
                        do_sample=False,
                        query_wrapper_prompt=query_wrapper_prompt,
                        tokenizer_name="Writer/camel-5b-hf",
                        model_name="Writer/camel-5b-hf",
                        device_map="auto",
                        tokenizer_kwargs={"max_length": 2048},
                        model_kwargs={"torch_dtype": torch.bfloat16}
                    )
I tried changing the chunk size value from 1024 to 1300 to 1500 but While actual querying it puts 512 as the chunk size
and what does your service context/prompt helper setup look like again?
Plain Text
embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

service_context = ServiceContext.from_defaults(chunk_size_limit=1024, llm_predictor=hf_predictor, embed_model=embed_model)
Ohhhh ok, so mpnet-base-v2 has an input size limit of 512, which is where the warning is coming from.

But putting in longer sequences shouldn't stop program execution, at least it didn't the last time I ran that model πŸ‘€
Yeah I ran it 2 days back, It worked,

I created new env today and used the same 0.6.15 as new version has some changes for vector store names which I havent updated in the code currently. But it started to break at this point
Initially the chunk size was set at 512 only but since I started getting this error, I tried changing it to different set of values to test
If I change the chunk size do I need to create embeddings again?
I wonder if transformers updated and broke this flow 😦

Yea if you change the chunk size, you'll need to re-embed again
Add a reply
Sign up and join the conversation on Discord