Find answers from the community

Updated 2 years ago

Chunk size

At a glance
I have set my chunksize to 1024 in the service context step, but when I'm querying
Getting this error
Token indices sequence length is longer than the specified maximum sequence length for this model (1043 > 512). Running this sequence through the model will result in indexing errors

Printed the chunk size limit as well while starting the server
Plain Text
print(service_context.chunk_size_limit)

Output: 1024


It looks like Chunk size limit is getting overrided with some default value, But default value set for chunk size is 1024
L
W
15 comments
That "error" isn't actually an error. Python3.8 uses a transformers tokenizer and that's just a warning coming from transformers (since it's a gpt2 tokenizer)
The chunk size is usually approximate tbh. 1043 is close, the token splitting code is a little complex lol
Kept this in a try section, it is failing πŸ˜…
oh, maybe this is something else then πŸ˜… Now that I look at it again, I see 512 is the max input size. Are you using a custom LLM anywhere?
I'm using HuggingFaceLLMPredictor class for the llm
Plain Text
        hf_predictor = HuggingFaceLLMPredictor(
                        max_input_size=2048,
                        max_new_tokens=256,
                        temperature=0.25,
                        do_sample=False,
                        query_wrapper_prompt=query_wrapper_prompt,
                        tokenizer_name="Writer/camel-5b-hf",
                        model_name="Writer/camel-5b-hf",
                        device_map="auto",
                        tokenizer_kwargs={"max_length": 2048},
                        model_kwargs={"torch_dtype": torch.bfloat16}
                    )
I tried changing the chunk size value from 1024 to 1300 to 1500 but While actual querying it puts 512 as the chunk size
and what does your service context/prompt helper setup look like again?
Plain Text
embed_model = LangchainEmbedding(HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2"))

service_context = ServiceContext.from_defaults(chunk_size_limit=1024, llm_predictor=hf_predictor, embed_model=embed_model)
Ohhhh ok, so mpnet-base-v2 has an input size limit of 512, which is where the warning is coming from.

But putting in longer sequences shouldn't stop program execution, at least it didn't the last time I ran that model πŸ‘€
Yeah I ran it 2 days back, It worked,

I created new env today and used the same 0.6.15 as new version has some changes for vector store names which I havent updated in the code currently. But it started to break at this point
Initially the chunk size was set at 512 only but since I started getting this error, I tried changing it to different set of values to test
If I change the chunk size do I need to create embeddings again?
I wonder if transformers updated and broke this flow 😦

Yea if you change the chunk size, you'll need to re-embed again
Add a reply
Sign up and join the conversation on Discord