Mismatch between slow and fast tokenizer #1682

KaiLv69 · 2024-11-15T03:14:54Z

Hi, I trained a sentencepiece tokenizer with prefix match. After convert to HF tokenizer, the tokenization result is not consistent with slow tokenizer.

In sentencepiece, we can choose whether to use prefix match to split the input into token sequences. (https://github.com/google/sentencepiece/blob/d8f741853847553169444afc12c00f4bbff3e9ce/src/bpe_model.cc#L111) I don't find similar func in tokenizers.

Is there any plan to support prefix match for better alignment with sentencepiece?

ArthurZucker · 2024-11-15T20:46:07Z

cc @Narsil as I have no direct idea, would need to look into it!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mismatch between slow and fast tokenizer #1682

Mismatch between slow and fast tokenizer #1682

KaiLv69 commented Nov 15, 2024

ArthurZucker commented Nov 15, 2024

Mismatch between slow and fast tokenizer #1682

Mismatch between slow and fast tokenizer #1682

Comments

KaiLv69 commented Nov 15, 2024

ArthurZucker commented Nov 15, 2024