Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add monocleaner #991

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

[WIP] Add monocleaner #991

wants to merge 11 commits into from

Conversation

eu9ene
Copy link
Collaborator

@eu9ene eu9ene commented Jan 15, 2025

  • Add Monocleaner to the cleaning pipeline
  • Fix issues with dataset names in cleaning configs

Let's start testing it first, then we can update docs with recommendations based on results.

closes #247
closes #476
closes #985
closes #789

echo "Threshold is 0, skipping filtering"
cp "${output_prefix}.${lang}.rule-based.zst" "${output_prefix}.${lang}.zst"
else
# the model is 125MB, similar in size to the fastText one, so it's ok to download it here
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bhearsum do you think it's still worth extracting this download to a separate task? I see the models are 100-200Mb, similar to the fastText ones. Extracting would save on some ingress traffic, but would complicate the pipeline and model updates, similar to the bilceaner ones. The bicleaner models are a lot larger, so it's justified to cache them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no objections. IMO the more important downside is that not caching these means that you'll be more susceptible to failures when wherever is hosting them is down.

(And I could be wrong, but it looks like ingress is free)

Copy link
Collaborator Author

@eu9ene eu9ene Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll skip it for now then. Let's see if we have issues with that. Github is hosting them, so I guess it should be reliable.

https://github.com/bitextor/monocleaner/blob/5a743407179468c83259999987138d6ae590c687/scripts/monocleaner-download#L32

@eu9ene eu9ene marked this pull request as ready for review January 18, 2025 01:19
@eu9ene eu9ene requested review from a team as code owners January 18, 2025 01:19
@eu9ene eu9ene requested review from jcristau and ZJaume January 18, 2025 01:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants