You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As the vocabulary of newer models, like Llama 3 or Gemma, increases in size, so does the size of the tokenizer, which includes the vocabulary as JSON (and merges for BPE tokenizers). Pretty-printing these files for serialization introduces a significant overhead as whitespace around the vocabulary and/or merges is added to the file.
This issue is even worse after the new BPE serialization update, which replaces merges like "s1 s2" with ["s1", "s2"], which is now formatted to be on separate lines:
From quick testing, not pretty-printing the tokenizer.json reduces the file size from 17MB to 7MB.
Understandably, pretty-printing the file can help with debugging, but for those cases, it's probably better that the default is not formatted (and have a flag for outputting with formatting).
cc @ArthurZucker
(PS: I can move this to huggingface/tokenizers if it is more applicable there.
Motivation
To reduce the file sizes (and bandwidth) of downloading, serializing, and uploading these files. In particular, this will greatly benefit Transformers.js users, where bandwidth is important.
Your contribution
The text was updated successfully, but these errors were encountered:
Feature request
As the vocabulary of newer models, like Llama 3 or Gemma, increases in size, so does the size of the tokenizer, which includes the vocabulary as JSON (and merges for BPE tokenizers). Pretty-printing these files for serialization introduces a significant overhead as whitespace around the vocabulary and/or merges is added to the file.
This issue is even worse after the new BPE serialization update, which replaces merges like
"s1 s2"
with["s1", "s2"]
, which is now formatted to be on separate lines:From quick testing, not pretty-printing the tokenizer.json reduces the file size from 17MB to 7MB.
Understandably, pretty-printing the file can help with debugging, but for those cases, it's probably better that the default is not formatted (and have a flag for outputting with formatting).
cc @ArthurZucker
(PS: I can move this to
huggingface/tokenizers
if it is more applicable there.Motivation
To reduce the file sizes (and bandwidth) of downloading, serializing, and uploading these files. In particular, this will greatly benefit Transformers.js users, where bandwidth is important.
Your contribution
The text was updated successfully, but these errors were encountered: