Find answers from the community

Updated 6 months ago

Chunk size

At a glance

The community member is trying to use the tokenizer of the BAAI/bge-large-en-v1.5 model with a TokenTextSplitter and is encountering an issue where the average chunk size is not close to the expected value of 512. They have provided their code and the node statistics, which show a minimum of 132 tokens, a maximum of 230 tokens, and an average of 215.989898989899 tokens. The community member is asking what the mistake might be, as the code seems to work as expected when using the default tokenizer (tiktoken).

In the comments, another community member asks how the chunk size is being counted, noting that chunking includes the longest metadata mode (node.get_content(metadata_mode="embed") or "llm"). The original community member clarifies that they are using metadata_mode = "embed" in this particular case.

There is no explicitly marked answer in the provided information.

Hello guys! Has anyone ever tried to pass tokenizer of a hugginface model to a TokenTextSplitter?

I tried to use the tokenizer of BAAI/bge-large-en-v1.5 model. I set chunk size = 512. So I assumed that te average chunk size in bge tokens would be somewhere near this value.

But seems like that's not the case -> Node statistics (tokens): min: 132, max: 230, mean: 215.989898989899, median: 216.0

My code:
Plain Text
embeddings = HuggingFaceEmbedding(model_name=embedding_model, device="cuda")
embeddings_tokenizer = AutoTokenizer.from_pretrained(embedding_model).encode

pipeline = IngestionPipeline(
        transformations=[
            TokenTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_size // 20,
                tokenizer=embeddings_tokenizer,
            ),
            # QuestionsAnsweredExtractor(questions=3, llm=llm),
            embeddings,
        ],
        vector_store=vector_store,
        # docstore=SimpleDocumentStore(),
    )

nodes = pipeline.run(documents=documents, show_progress=True, num_workers=1)


What is my mistake? This seems to work as expected when i do not pass any tokenizer (basically when it uses the standard one, tiktoken, I assume)
L
s
2 comments
How are you counting the chunk size? Chunking includes the longest metadata mode

node.get_content(metadata_mode="embed") or "llm"
In this particular case i used metadata_mode = "embed"
Add a reply
Sign up and join the conversation on Discord