Space after unnormalized token is added when `use_fast=True` for Llama tokenizer #1613

Butanium · 2024-08-14T11:48:06Z

Related to: huggingface/transformers#25073

In my current project, I'd like to add a special token that doesn't insert a space to the next token.
Currently, I need to specify use_fast=False in order for this to work. However:

This is unclear to me why I should expect slow and fast tokenizer to behave in different ways
This finding doesn't generalize to e.g. gemma tokenizer which never add such space
Maybe this should be an explicit option in the tokenizer_kwargs?

from transformers import AutoTokenizer
fast_tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf", use_fast=True)
slow_tokenizer = AutoTokenizer.from_pretrained("meta-llama/llama-2-7b-hf", use_fast=False)
tok = fast_tokenizer.bos_token
s = f'a:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁a', ':', '<s>', '▁->']
>>> slow: ['▁a', ':', '<s>', '->']

The text was updated successfully, but these errors were encountered:

Butanium · 2024-08-14T11:49:46Z

Oh wait @ArthurZucker is that what you're fixing here in #1568 ?

Butanium · 2024-08-14T12:14:41Z

Same issue with unnormalized non-special tokens:

from tokenizers import AddedToken
from transformers import AutoTokenizer
tok_name = "meta-llama/llama-2-7b-hf"
fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True)
slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False)
tok = "<special>"
t = AddedToken(
    tok, normalized=False, special=False
 )
fast_tokenizer.add_tokens([t])
slow_tokenizer.add_tokens([t])
s = f'hello:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁hello', ':', '<special>', '▁->']
>>> slow: ['▁hello', ':', '<special>', '->']

Butanium · 2024-08-14T12:19:28Z

And there is even more differences when you add normalized=True for special tokens ...

from tokenizers import AddedToken
from transformers import AutoTokenizer
tok_name = "meta-llama/llama-2-7b-hf"
fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True)
slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False)
tok = "<special>"
t = AddedToken(
    tok, normalized=True, special=True
 )
fast_tokenizer.add_tokens([t], special_tokens=True)
slow_tokenizer.add_tokens([t], special_tokens=True)
s = f'hello:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁hello', ':', '<', 'special', '>', '->']
>>> slow: ['▁hello', ':', '<special>', '->']

Butanium · 2024-08-14T13:02:35Z

Also, if you specify the add_prefix_space arg, the tokenizer is actually using the slow implementation which leads to different behavior for the above code! https://github.com/huggingface/transformers/blob/9485289f374d4df7e8aa0ca917dc131dcf64ebaf/src/transformers/models/llama/tokenization_llama_fast.py#L154

ArthurZucker · 2024-08-15T17:25:51Z

No this was fixed a LONG time ago!

from tokenizers import AddedToken
from transformers import AutoTokenizer
tok_name = "meta-llama/llama-2-7b-hf"
fast_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=True, legacy=False, from_slow=True)
slow_tokenizer = AutoTokenizer.from_pretrained(tok_name, use_fast=False)
tok = "<special>"
t = AddedToken(
    tok, normalized=True, special=True
 )
fast_tokenizer.add_tokens([t], special_tokens=True)
slow_tokenizer.add_tokens([t], special_tokens=True)
s = f'hello:{tok}->'
print(f"fast: {fast_tokenizer.tokenize(s)}\nslow: {slow_tokenizer.tokenize(s)}")
>>> fast: ['▁hello', ':', '<', 'special', '>', '->']
>>> slow: ['▁hello', ':', '<special>', '->']

ArthurZucker · 2024-08-15T17:26:13Z

See #1357

Butanium · 2024-08-26T10:44:56Z

Hey @ArthurZucker, thanks for your answer. I'm using 0.19.1 which should have the fix.
I'm really confused right now. Why isn't the fact that use_fast alters the behavior of the tokenizer an issue?
My more practical question is, is there a way to add a token s.t.:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model, ...) # magic kwargs
# magically add <special>
s = f'a:<special>->'
print(tokenizer.tokenize(s)})

will always prints [ {whatever}, '<special>', '->'] where the key point here is that there is a -> and not a _-> token

ArthurZucker · 2024-08-26T13:06:39Z

Yes, what effects this is the legacy flag, as Llama was added before we fixed the issue.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model, legacy=False) # magic kwargs
# magically add <special>
s = f'a:<special>->'
print(tokenizer.tokenize(s)})

When you set legacy to False you might not always get the conversion from slow, which forces the legacy attribute to be actually taken into account!

Butanium · 2024-08-29T23:16:36Z

Ok so I should do some unit test and choose different kwarg depending on the tokenizer to get the same behavior?

ArthurZucker · 2024-08-30T15:23:39Z

No, sorry. Basically you can just check the tokenizer's pre_tokenizer. If it's metaspace, the prepend_scheme should be set to first instead of always

Butanium changed the title ~~Space after special token is added when use_fast=True for Llama tokenizer~~ Space after unnormalized token is added when use_fast=True for Llama tokenizer Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer #1613

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer #1613

Butanium commented Aug 14, 2024 •

edited

Loading

Butanium commented Aug 14, 2024 •

edited

Loading

Butanium commented Aug 14, 2024 •

edited

Loading

Butanium commented Aug 14, 2024 •

edited

Loading

Butanium commented Aug 14, 2024

ArthurZucker commented Aug 15, 2024

ArthurZucker commented Aug 15, 2024

Butanium commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

Butanium commented Aug 29, 2024

ArthurZucker commented Aug 30, 2024

Space after unnormalized token is added when use_fast=True for Llama tokenizer #1613

Space after unnormalized token is added when use_fast=True for Llama tokenizer #1613

Comments

Butanium commented Aug 14, 2024 • edited Loading

Butanium commented Aug 14, 2024 • edited Loading

Butanium commented Aug 14, 2024 • edited Loading

Butanium commented Aug 14, 2024 • edited Loading

Butanium commented Aug 14, 2024

ArthurZucker commented Aug 15, 2024

ArthurZucker commented Aug 15, 2024

Butanium commented Aug 26, 2024

ArthurZucker commented Aug 26, 2024

Butanium commented Aug 29, 2024

ArthurZucker commented Aug 30, 2024

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer #1613

Space after unnormalized token is added when `use_fast=True` for Llama tokenizer #1613

Butanium commented Aug 14, 2024 •

edited

Loading

Butanium commented Aug 14, 2024 •

edited

Loading

Butanium commented Aug 14, 2024 •

edited

Loading

Butanium commented Aug 14, 2024 •

edited

Loading