You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi there, this may be stupid but I felt confused...
I compiled a corpus containing 20GB of pure raw text and wanted to train my customized BBPE tokenizer.
With the guidance of your NLP course (https://huggingface.co/learn/nlp-course/chapter2/4. It's friendly and easy to understand BTW!) I used the same code, and it seemed good at first. But it turned slow and ran out of memory soon when merging.
The program failed on servers with 1.5TB or 2TB memory.
From some tutorials, the BPE tokenizer may need to do heavy statistics in memory. But I guess there should be better solutions like multi-node training for top players like OAI (with closed models) or huggingface (with OS models) to train their own tokenizers on very large corpus (say 500GB texts or much more).
Anyone would kindly show me the way to do this in an industrialized style?
The text was updated successfully, but these errors were encountered:
Hi there, this may be stupid but I felt confused...
I compiled a corpus containing 20GB of pure raw text and wanted to train my customized BBPE tokenizer.
With the guidance of your NLP course (https://huggingface.co/learn/nlp-course/chapter2/4. It's friendly and easy to understand BTW!) I used the same code, and it seemed good at first. But it turned slow and ran out of memory soon when merging.
The program failed on servers with 1.5TB or 2TB memory.
From some tutorials, the BPE tokenizer may need to do heavy statistics in memory. But I guess there should be better solutions like multi-node training for top players like OAI (with closed models) or huggingface (with OS models) to train their own tokenizers on very large corpus (say 500GB texts or much more).
Anyone would kindly show me the way to do this in an industrialized style?
The text was updated successfully, but these errors were encountered: