huggingface / tokenizers Public

Notifications You must be signed in to change notification settings
Fork 801
Star 9k

Code
Issues 45
Pull requests 7
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Issues: huggingface/tokenizers

ByteLevelBPETokenizer output seems weird

#203 opened Mar 24, 2020 by seyyaw

Open 2

Training a model from in-memory data

#198 by loicbarrault was closed Nov 28, 2020

Closed 1

Labels 24 Milestones 2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

45 Open 957 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

Mismatch between slow and fast tokenizer

#1682 opened Nov 15, 2024 by KaiLv69

out of memory when training a BBPE tokenizer on a large corpus

#1681 opened Nov 14, 2024 by yucc-leon

Option to disable cache for FromPretrained and FromFile Feature Request

#1680 opened Nov 12, 2024 by daulet

Reduce vocab size for BPE tokenizer Feature Request

#1668 opened Oct 29, 2024 by fzyzcjy

docs-check.yml uses node12 which is deprecated

#1658 opened Oct 17, 2024 by hamirmahal

Allow users to select/write encoding strategies Feature Request

#1655 opened Oct 16, 2024 by pietrolesci

Serializing k-mer style pre-tokenizer

#1654 opened Oct 15, 2024 by millanp95

Inconsistent behaviour of PreTrainedTokenizerFasts on diacritics marked texts bug

Something isn't working

#1663 opened Oct 11, 2024 by sven-nm

2 of 4 tasks

Disable pretty-print when saving tokenizer.json files Feature Request

#1656 opened Oct 7, 2024 by xenova

How to build a custom tokenizer on top of a exsiting Llama 3.2 tokenizer? training

#1644 opened Oct 5, 2024 by yakhyo

Adding tokens to a tokenizer with subword support?

#1637 opened Sep 27, 2024 by noamgat

NormalizedString.clear() broken? bug

Something isn't working

#1636 opened Sep 25, 2024 by lkurlandski

Adding many AddedTokens makes loading a tokenizer extremely slow.

#1635 opened Sep 25, 2024 by stephantul

Cannot inject custom PreTokenizer into Tokenizer

#1634 opened Sep 23, 2024 by Old-Shatterhand

README.md contains non-functional code

#1633 opened Sep 19, 2024 by ahenkes1

Access utf-8 byte sequence for each token

#1628 opened Sep 9, 2024 by DanielHesslow

Rust: How to handle models with precompiled_charsmap = null Feature Request

#1627 opened Sep 4, 2024 by kallebysantos

Tokenizer Quickstart Tutorial: Broken Links

#1625 opened Sep 3, 2024 by SinaMostafanejad

Special token gets tokenized while training tokenizer from scratch

#1624 opened Sep 2, 2024 by LalchandPandia

ModuleNotFoundError: No module named 'tokenizers.tokenizers'

#1619 opened Aug 25, 2024 by jpferraro1

BPE trainer ignoring special tokens.

#1616 opened Aug 16, 2024 by henrycharlesworth

.NET bindings

#1615 opened Aug 16, 2024 by sappho192

Space after unnormalized token is added when use_fast=True for Llama tokenizer

#1613 opened Aug 14, 2024 by Butanium

RefMutContainer is unsound

#1612 opened Aug 13, 2024 by CheaterCodes

[test-infra] Enable Codecov for tokenizers

#1611 opened Aug 12, 2024 by hvaara

Previous 1 2 Next

Previous Next

ProTip! Follow long discussions with comments:>50.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly