Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out of memory when training a BBPE tokenizer on a large corpus #1681

Open
yucc-leon opened this issue Nov 14, 2024 · 2 comments
Open

out of memory when training a BBPE tokenizer on a large corpus #1681

yucc-leon opened this issue Nov 14, 2024 · 2 comments

Comments

@yucc-leon
Copy link

Hi there, this may be stupid but I felt confused...
I compiled a corpus containing 20GB of pure raw text and wanted to train my customized BBPE tokenizer.

With the guidance of your NLP course (https://huggingface.co/learn/nlp-course/chapter2/4. It's friendly and easy to understand BTW!) I used the same code, and it seemed good at first. But it turned slow and ran out of memory soon when merging.

The program failed on servers with 1.5TB or 2TB memory.
From some tutorials, the BPE tokenizer may need to do heavy statistics in memory. But I guess there should be better solutions like multi-node training for top players like OAI (with closed models) or huggingface (with OS models) to train their own tokenizers on very large corpus (say 500GB texts or much more).

Anyone would kindly show me the way to do this in an industrialized style?

@ArthurZucker
Copy link
Collaborator

Hey! I think this is related to #1539 and we are fixing it! 🤗

@ArthurZucker
Copy link
Collaborator

We are doing a release to include this in a few days!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants