We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Seems that the downloader is discarding (or not writing to disk) all sentences regardless of the threshold? If I run
env PYTHONPATH=. python pipeline/data/download-mono.py --artifacts ./tonto_artifacts --dataset hplt_mono/v1.2 --language en --max_sentences 200000000 --hlpt_min_fluency 0.0
there is not even one line preserved out of many millions downloaded
[downloads] Reading lines from: https://data.hplt-project.org/one/monotext/cleaned/en/en_135.jsonl.zst [downloads] Download size: 15,343,539,325 bytes [importers.mono] Visited 5,000,000 lines [importers.mono] Kept 0. [importers.mono] Wrote 0 out of 200,000,000. [memory] 59.0 MB [importers.mono] Visited 10,000,000 lines [importers.mono] Kept 0. [importers.mono] Wrote 0 out of 200,000,000. [memory] 59.0 MB (+0 B) [importers.mono] Visited 15,000,000 lines [importers.mono] Kept 5. [importers.mono] Wrote 0 out of 200,000,000. [memory] 59.0 MB (+41.0 KB) ... [importers.mono] Visited 105,000,000 lines [importers.mono] Kept 905. [importers.mono] Wrote 0 out of 200,000,000. [memory] 59.4 MB (-110.6 KB) [importers.mono] Visited 110,000,000 lines [importers.mono] Kept 912. [importers.mono] Wrote 0 out of 200,000,000.
This happened in a task that ran yesterday for Arabic: https://firefox-ci-tc.services.mozilla.com/tasks/UYJgKcM2Qguta80V_VQZDg/runs/0/logs/public/logs/live.log And is happening in a task that is running for English right now: https://firefox-ci-tc.services.mozilla.com/tasks/c40gMuBfQk-TS1Tl0By7TQ
The text was updated successfully, but these errors were encountered:
Seems that setting --hlpt_max_characters to non-zero value allows sentences to be written.
--hlpt_max_characters
Sorry, something went wrong.
Maybe there's a bug. 0 is a default and is supposed to mean that we don't merge sentences into paragraphs. https://github.com/mozilla/translations/pull/901/files
eu9ene
Successfully merging a pull request may close this issue.
Seems that the downloader is discarding (or not writing to disk) all sentences regardless of the threshold? If I run
there is not even one line preserved out of many millions downloaded
This happened in a task that ran yesterday for Arabic: https://firefox-ci-tc.services.mozilla.com/tasks/UYJgKcM2Qguta80V_VQZDg/runs/0/logs/public/logs/live.log
And is happening in a task that is running for English right now: https://firefox-ci-tc.services.mozilla.com/tasks/c40gMuBfQk-TS1Tl0By7TQ
The text was updated successfully, but these errors were encountered: