Hello guys! Has anyone ever tried to pass tokenizer of a hugginface model to a TokenTextSplitter?
I tried to use the tokenizer of BAAI/bge-large-en-v1.5 model. I set chunk size = 512. So I assumed that te average chunk size in bge tokens would be somewhere near this value.
But seems like that's not the case -> Node statistics (tokens): min: 132, max: 230, mean: 215.989898989899, median: 216.0
My code:
embeddings = HuggingFaceEmbedding(model_name=embedding_model, device="cuda")
embeddings_tokenizer = AutoTokenizer.from_pretrained(embedding_model).encode
pipeline = IngestionPipeline(
transformations=[
TokenTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_size // 20,
tokenizer=embeddings_tokenizer,
),
# QuestionsAnsweredExtractor(questions=3, llm=llm),
embeddings,
],
vector_store=vector_store,
# docstore=SimpleDocumentStore(),
)
nodes = pipeline.run(documents=documents, show_progress=True, num_workers=1)
What is my mistake? This seems to work as expected when i do not pass any tokenizer (basically when it uses the standard one, tiktoken, I assume)