Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HPLT downloader is not keeping any sentence #995

Open
ZJaume opened this issue Jan 16, 2025 · 2 comments · May be fixed by #997
Open

HPLT downloader is not keeping any sentence #995

ZJaume opened this issue Jan 16, 2025 · 2 comments · May be fixed by #997
Assignees
Labels
bug Something is broken or not correct

Comments

@ZJaume
Copy link
Collaborator

ZJaume commented Jan 16, 2025

Seems that the downloader is discarding (or not writing to disk) all sentences regardless of the threshold? If I run

env PYTHONPATH=. python pipeline/data/download-mono.py --artifacts ./tonto_artifacts --dataset hplt_mono/v1.2 --language en --max_sentences 200000000 --hlpt_min_fluency 0.0

there is not even one line preserved out of many millions downloaded

[downloads] Reading lines from: https://data.hplt-project.org/one/monotext/cleaned/en/en_135.jsonl.zst
[downloads] Download size: 15,343,539,325 bytes
[importers.mono] Visited 5,000,000 lines
[importers.mono] Kept 0.
[importers.mono] Wrote 0 out of 200,000,000.
[memory] 59.0 MB
[importers.mono] Visited 10,000,000 lines
[importers.mono] Kept 0.
[importers.mono] Wrote 0 out of 200,000,000.
[memory] 59.0 MB (+0 B)
[importers.mono] Visited 15,000,000 lines
[importers.mono] Kept 5.
[importers.mono] Wrote 0 out of 200,000,000.
[memory] 59.0 MB (+41.0 KB)
...
[importers.mono] Visited 105,000,000 lines
[importers.mono] Kept 905.
[importers.mono] Wrote 0 out of 200,000,000.
[memory] 59.4 MB (-110.6 KB)
[importers.mono] Visited 110,000,000 lines
[importers.mono] Kept 912.
[importers.mono] Wrote 0 out of 200,000,000.

This happened in a task that ran yesterday for Arabic: https://firefox-ci-tc.services.mozilla.com/tasks/UYJgKcM2Qguta80V_VQZDg/runs/0/logs/public/logs/live.log
And is happening in a task that is running for English right now: https://firefox-ci-tc.services.mozilla.com/tasks/c40gMuBfQk-TS1Tl0By7TQ

@ZJaume
Copy link
Collaborator Author

ZJaume commented Jan 16, 2025

Seems that setting --hlpt_max_characters to non-zero value allows sentences to be written.

@eu9ene
Copy link
Collaborator

eu9ene commented Jan 16, 2025

Maybe there's a bug. 0 is a default and is supposed to mean that we don't merge sentences into paragraphs. https://github.com/mozilla/translations/pull/901/files

@eu9ene eu9ene added the bug Something is broken or not correct label Jan 18, 2025
@eu9ene eu9ene self-assigned this Jan 18, 2025
@eu9ene eu9ene linked a pull request Jan 18, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is broken or not correct
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants