-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add monocleaner #991
base: main
Are you sure you want to change the base?
Conversation
echo "Threshold is 0, skipping filtering" | ||
cp "${output_prefix}.${lang}.rule-based.zst" "${output_prefix}.${lang}.zst" | ||
else | ||
# the model is 125MB, similar in size to the fastText one, so it's ok to download it here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhearsum do you think it's still worth extracting this download to a separate task? I see the models are 100-200Mb, similar to the fastText ones. Extracting would save on some ingress traffic, but would complicate the pipeline and model updates, similar to the bilceaner ones. The bicleaner models are a lot larger, so it's justified to cache them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no objections. IMO the more important downside is that not caching these means that you'll be more susceptible to failures when wherever is hosting them is down.
(And I could be wrong, but it looks like ingress is free)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll skip it for now then. Let's see if we have issues with that. Github is hosting them, so I guess it should be reliable.
Let's start testing it first, then we can update docs with recommendations based on results.
closes #247
closes #476
closes #985
closes #789