Find answers from the community

Updated 2 years ago

Stopping

Is there anyway I can pass in text for the as the stop text or use the tokenizer without using transformers autotokenizer and having it called twice
L
d
14 comments
What's the issue with the tokenizer?

Currently huggingface needs to have stopping IDa, rather than text, since the model predicts IDs at the base level (and it should stop on a particular ID)

You cooooould have stopping words, but this would be pretty hacky and not easily supported
yeah each model has different ids right?
so for each model i would have to get those ids via text and pass them in
Yea pretty much, it's dependent on the tokenizer

Usually for most models you won't need to do this though. For example, camel does not require this, but the examples for stablelm used it so that's why I included the option in the docs.

Usually you would figure out the ids you need ahead of time for the model you have and always use the same ones
yeah but how do i get those ids
Plain Text
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
>>> tokenizer.convert_tokens_to_ids(["hello", "world"])
[7592, 2088]
Like that 💪
yeah so i have to load the tokenizer twice
me myself and then when the library does it
but only once because they won’t change?
Yea once you have the ids, they will never change
Unless you are dynamically setting stopping ids
(Which sounds a little different, I'd be curious about your use case if they are dynamic haha)
okay thanks!
Add a reply
Sign up and join the conversation on Discord