Find answers from the community

Updated 3 months ago

Chunk size

Hello guys! Has anyone ever tried to pass tokenizer of a hugginface model to a TokenTextSplitter?

I tried to use the tokenizer of BAAI/bge-large-en-v1.5 model. I set chunk size = 512. So I assumed that te average chunk size in bge tokens would be somewhere near this value.

But seems like that's not the case -> Node statistics (tokens): min: 132, max: 230, mean: 215.989898989899, median: 216.0

My code:
Plain Text
embeddings = HuggingFaceEmbedding(model_name=embedding_model, device="cuda")
embeddings_tokenizer = AutoTokenizer.from_pretrained(embedding_model).encode

pipeline = IngestionPipeline(
        transformations=[
            TokenTextSplitter(
                chunk_size=chunk_size,
                chunk_overlap=chunk_size // 20,
                tokenizer=embeddings_tokenizer,
            ),
            # QuestionsAnsweredExtractor(questions=3, llm=llm),
            embeddings,
        ],
        vector_store=vector_store,
        # docstore=SimpleDocumentStore(),
    )

nodes = pipeline.run(documents=documents, show_progress=True, num_workers=1)


What is my mistake? This seems to work as expected when i do not pass any tokenizer (basically when it uses the standard one, tiktoken, I assume)
L
s
2 comments
How are you counting the chunk size? Chunking includes the longest metadata mode

node.get_content(metadata_mode="embed") or "llm"
In this particular case i used metadata_mode = "embed"
Add a reply
Sign up and join the conversation on Discord