You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Example. Consider a BPE tokenizer with merges M = {yu, yum, my} and initial alphabet A = {y, u, m}. Given the string s = yummy, the standard BPE merge-based strategy tokenizes s as yu | m | my while BPE with the longest prefix encoding strategy tokenizes s as yum | my.
The text was updated successfully, but these errors were encountered:
Hi @ArthurZucker,
Thanks a lot for your swift reply!
I think it will be super useful, especially for research purposes. Perhaps, the simplest thing would be to allow BPE tokenizers to behave like WordPiece at inference time. In the same way users can assign, e.g., pre_tokenizers to a tokenizer class, they could in principle be able to pass a, e.g., predictor too. What do you think?
Hi there,
Do you plan to add the possibility to control how tokenizers behave at inference time?
For example, adding the possibility for the user to decide whether to use standard BPE (merges) or, e.g., the longest prefix encoding strategy. See Greed is All You Need: An Evaluation of Tokenizer Inference Methods for why this can be useful.
Thanks in advance for your time!
Best,
Pietro
Example. Consider a BPE tokenizer with merges
M = {yu, yum, my}
and initial alphabetA = {y, u, m}
. Given the strings = yummy
, the standard BPE merge-based strategy tokenizess
asyu | m | my
while BPE with the longest prefix encoding strategy tokenizess
asyum | my
.The text was updated successfully, but these errors were encountered: