-
Notifications
You must be signed in to change notification settings - Fork 801
Issues: huggingface/tokenizers
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Author
Label
Projects
Milestones
Assignee
Sort
Issues list
out of memory when training a BBPE tokenizer on a large corpus
#1681
opened Nov 14, 2024 by
yucc-leon
Option to disable cache for FromPretrained and FromFile
Feature Request
#1680
opened Nov 12, 2024 by
daulet
Allow users to select/write encoding strategies
Feature Request
#1655
opened Oct 16, 2024 by
pietrolesci
Inconsistent behaviour of Something isn't working
PreTrainedTokenizerFast
s on diacritics marked texts
bug
#1663
opened Oct 11, 2024 by
sven-nm
2 of 4 tasks
Disable pretty-print when saving tokenizer.json files
Feature Request
#1656
opened Oct 7, 2024 by
xenova
How to build a custom tokenizer on top of a exsiting Llama 3.2 tokenizer?
training
#1644
opened Oct 5, 2024 by
yakhyo
NormalizedString.clear() broken?
bug
Something isn't working
#1636
opened Sep 25, 2024 by
lkurlandski
Adding many AddedTokens makes loading a tokenizer extremely slow.
#1635
opened Sep 25, 2024 by
stephantul
Rust: How to handle models with
precompiled_charsmap = null
Feature Request
#1627
opened Sep 4, 2024 by
kallebysantos
Special token gets tokenized while training tokenizer from scratch
#1624
opened Sep 2, 2024 by
LalchandPandia
ModuleNotFoundError: No module named 'tokenizers.tokenizers'
#1619
opened Aug 25, 2024 by
jpferraro1
Space after unnormalized token is added when
use_fast=True
for Llama tokenizer
#1613
opened Aug 14, 2024 by
Butanium
Previous Next
ProTip!
Follow long discussions with comments:>50.