Disable pretty-print when saving tokenizer.json files #1656

xenova · 2024-10-07T12:30:55Z

Feature request

As the vocabulary of newer models, like Llama 3 or Gemma, increases in size, so does the size of the tokenizer, which includes the vocabulary as JSON (and merges for BPE tokenizers). Pretty-printing these files for serialization introduces a significant overhead as whitespace around the vocabulary and/or merges is added to the file.

This issue is even worse after the new BPE serialization update, which replaces merges like "s1 s2" with ["s1", "s2"], which is now formatted to be on separate lines:

From quick testing, not pretty-printing the tokenizer.json reduces the file size from 17MB to 7MB.

Understandably, pretty-printing the file can help with debugging, but for those cases, it's probably better that the default is not formatted (and have a flag for outputting with formatting).

cc @ArthurZucker
(PS: I can move this to huggingface/tokenizers if it is more applicable there.

Motivation

To reduce the file sizes (and bandwidth) of downloading, serializing, and uploading these files. In particular, this will greatly benefit Transformers.js users, where bandwidth is important.

Your contribution

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-10-10T10:04:39Z

We already have a pretty argument in tokenizers but we should give a bit more granularity

xenova added the Feature Request label Oct 7, 2024

ArthurZucker transferred this issue from huggingface/transformers Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable pretty-print when saving tokenizer.json files #1656

Disable pretty-print when saving tokenizer.json files #1656

xenova commented Oct 7, 2024

ArthurZucker commented Oct 10, 2024

Disable pretty-print when saving tokenizer.json files #1656

Disable pretty-print when saving tokenizer.json files #1656

Comments

xenova commented Oct 7, 2024

Feature request

Motivation

Your contribution

ArthurZucker commented Oct 10, 2024