Find answers from the community

Updated 2 years ago

Stopping

Is there anyway I can pass in text for the as the stop text or use the tokenizer without using transformers autotokenizer and having it called twice

14 comments

LLogan M

What's the issue with the tokenizer?

Currently huggingface needs to have stopping IDa, rather than text, since the model predicts IDs at the base level (and it should stop on a particular ID)

You cooooould have stopping words, but this would be pretty hacky and not easily supported

ddamon

yeah each model has different ids right?

ddamon

so for each model i would have to get those ids via text and pass them in

LLogan M

Yea pretty much, it's dependent on the tokenizer

Usually for most models you won't need to do this though. For example, camel does not require this, but the examples for stablelm used it so that's why I included the option in the docs.

Usually you would figure out the ids you need ahead of time for the model you have and always use the same ones

ddamon

yeah but how do i get those ids

LLogan M

Plain Text

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.convert_tokens_to_ids(["hello", "world"])
[7592, 2088]

LLogan M

Like that 💪

ddamon

yeah so i have to load the tokenizer twice

ddamon

me myself and then when the library does it

ddamon

but only once because they won’t change?

LLogan M

Yea once you have the ids, they will never change

LLogan M

Unless you are dynamically setting stopping ids

LLogan M

(Which sounds a little different, I'd be curious about your use case if they are dynamic haha)

ddamon

okay thanks!

Add a reply