out of memory when training a BBPE tokenizer on a large corpus #1681

yucc-leon · 2024-11-14T03:27:44Z

Hi there, this may be stupid but I felt confused...
I compiled a corpus containing 20GB of pure raw text and wanted to train my customized BBPE tokenizer.

With the guidance of your NLP course (https://huggingface.co/learn/nlp-course/chapter2/4. It's friendly and easy to understand BTW!) I used the same code, and it seemed good at first. But it turned slow and ran out of memory soon when merging.

The program failed on servers with 1.5TB or 2TB memory.
From some tutorials, the BPE tokenizer may need to do heavy statistics in memory. But I guess there should be better solutions like multi-node training for top players like OAI (with closed models) or huggingface (with OS models) to train their own tokenizers on very large corpus (say 500GB texts or much more).

Anyone would kindly show me the way to do this in an industrialized style?

ArthurZucker · 2024-11-15T20:52:51Z

Hey! I think this is related to #1539 and we are fixing it! 🤗

ArthurZucker · 2024-11-15T20:53:08Z

We are doing a release to include this in a few days!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out of memory when training a BBPE tokenizer on a large corpus #1681

out of memory when training a BBPE tokenizer on a large corpus #1681

yucc-leon commented Nov 14, 2024

ArthurZucker commented Nov 15, 2024

ArthurZucker commented Nov 15, 2024

out of memory when training a BBPE tokenizer on a large corpus #1681

out of memory when training a BBPE tokenizer on a large corpus #1681

Comments

yucc-leon commented Nov 14, 2024

ArthurZucker commented Nov 15, 2024

ArthurZucker commented Nov 15, 2024