-
Notifications
You must be signed in to change notification settings - Fork 801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BPE trainer ignoring special tokens. #1616
Comments
Hey! you are adding the tokens before initializing the normalizer, this worked for me: from tokenizers import Tokenizer, Regex
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Sequence as PretokenizerSequence, Split
from tokenizers.normalizers import Sequence as NormalizerSequence, Replace, BertNormalizer, Strip
corpus_file = "corpus.txt"
special_tokens = [
"<s>",
"<pad>",
"</s>",
"<unk>"
]
for i in range(20):
special_tokens.append(f"<disasm_function_{i}>")
special_tokens.append(f"<disasm_string_{i}>")
tokenizer = Tokenizer(BPE())
- tokenizer.add_special_tokens(special_tokens)
tokenizer.normalizer = NormalizerSequence([
Strip(),
BertNormalizer(clean_text=True, strip_accents=True, lowercase=True),
Replace(Regex("\s{2,}"), " "),
Replace(" ", "<space>")
])
tokenizer.pre_tokenizer = PretokenizerSequence([
Split("\n", behavior="removed")
])
+ tokenizer.add_special_tokens(special_tokens)
trainer = BpeTrainer(
special_tokens=special_tokens, vocab_size=10000, min_frequency=2,
)
tokenizer.train(files=[corpus_file], trainer=trainer)
tokenizer.save("example_tokenizer.json") |
So I tried this and for me it still gives exactly the same result. It works at test time (as did the previous version), but during training it is still merging across the special tokens. |
You are right, sorry. Here is a PR with a fix, not sure why we never had that. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I am trying to train a custom tokenizer. My use case is related to assembly code, so I want merges to be possible across full instructions (potentially multiple "words"). To do this, I am replacing all spaces with a dummy token (e.g. "
<space>
"), and have a pretokenizer that splits on "\n". This basically works, but my issue comes when I try to add in special tokens. The following is a simple example to reproduce the issue:An example segment of my corpus I am using to train will look something like:
so the aim is to ensure that e.g. <disasm_function_1> is always a single token. This works at test time (i.e. these special tokens are always tokenized as single tokens), but it's clearly not happening during the BPE training. If I examine the tokens/merges I am getting out, many of them contain the special tokens within them. E.g. from the resulting JSON file:
you can see these learned tokens contain the special tokens within them.
Is this expected behaviour? My assumption was that the BPE trainer would prevent this from happening (as I provide it with a list of the special tokens - why else would it need this argument)? And it's not very desirable to fill up the vocab with lots of merges that aren't ever going to be valid.
Is there anyway to stop this from happening (or is there anything that I haven't set up properly?)
EDIT:
My current horrible workaround is to do:
which seems to work, but can't be the best way.
The text was updated successfully, but these errors were encountered: