How to Trim Text to Fit an Embedding Model Without Know...

At a glance

The community member is asking how to ensure text fits into an embedding model, as they cannot know the tokenizer or input size of the model a priori. The community members suggest using tiktoken to count the number of tokens in the text, and that most embedding models simply truncate the text if it exceeds the input size. For the NVIDIA model specifically, the community members note that there is a parameter to control the truncation behavior, such as NVIDIAEmbedding(model="nvidia/nv-embedqa-e5-v5", truncate="END"). Another suggestion is to create summaries of larger documents by summarizing chunks with overlaps, using a tool like documentsummaryindex.

Useful resources

SSnowBloom

hey all: how do you ensure text fits into an embedding model?
you can't know apriori what tokenizer an embedding model uses - or even its input size! Or can you somehow?
if I have some arbitrary string 'text' and I need to trim it shorter so it fits into 'embed_model', whats the approach?

There must be a simple solution I am missing!

9 comments

ffiksii4290

https://docs.llamaindex.ai/en/stable/examples/observability/TokenCountingHandler/

SSnowBloom

thanks ill look into this 🙂

LLogan M

Most embedding models also just truncate if it goes over, so as long as you are "close enough" its usually fine imo

SSnowBloom

I guess its just the nvidia model that doesn't. if im 1 token over it throws an exception! I guess I just assumed it was typical

LLogan M

Oh this is a param

LLogan M

for nvidia

LLogan M

Plain Text

NVIDIAEmbedding(
  model="nvidia/nv-embedqa-e5-v5", truncate="END"
)

LLogan M

like that

SShubyZ

You can use tiktoken to count the number of tokens in the text given. Another thing I’m doing is creating summaries of larger document where you summarize chunks with overlaps. Check out documentsummaryindex

Add a reply

Find answers from the community

How to Trim Text to Fit an Embedding Model Without Knowing the Tokenizer or Input Size