You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I trained a sentencepiece tokenizer with prefix match. After convert to HF tokenizer, the tokenization result is not consistent with slow tokenizer.
Hi, I trained a sentencepiece tokenizer with prefix match. After convert to HF tokenizer, the tokenization result is not consistent with slow tokenizer.
In sentencepiece, we can choose whether to use prefix match to split the input into token sequences. (https://github.com/google/sentencepiece/blob/d8f741853847553169444afc12c00f4bbff3e9ce/src/bpe_model.cc#L111) I don't find similar func in tokenizers.
Is there any plan to support prefix match for better alignment with sentencepiece?
The text was updated successfully, but these errors were encountered: