Use `Intl.Segmenter` instead of `ssplit` for segmentation in WASM builds #945

nordzilla · 2024-11-26T23:32:48Z

Description

Note

I have two follow-up issues filed for after this patch lands. Please take them into consideration when reviewing.

This pull request fully removes the dependency on ssplit from WASM builds of the inference engine, instead utilizing the Intl.Segmenter implementation within the JavaScript context in which the WASM is being run. This offers a number of benefits.

Correctness Improvements

Intl.Segmenter is created from a specific locale and knows how to segment text according to that language.

ssplit has limited support for CJK segmentation and is a one-size-fits-all segmentation solution that overall works well, but is not robust enough in its current state for all of the languages we want to support.

As an experiment, I cherry-picked the new test cases from this pull request onto the main branch and ran them.

Category	Passed	Failed	Total
Test Files	5	7	12
Test Cases	47	21	68

Note

While the Intl.Segmenter improves correctness for languages that do not utilize whitespace between sentences, I have no reason to believe that it has regressed the preexisting behavior of translating between two languages that utilize whitespace between sentences. As a check, I did a diff of the HTML output of the benchmark page after doing a pivot translation from Spanish to French, and the output was identical both before and after the changes in this PR.

Upstream Updates

By moving to Intl.Segmenter we also get free, standardized updates whenever its algorithms are improved as part of the JavaScript API.

Compiled WASM Binary Size Reduction

By fully removing ssplit as a dependency from the WASM build, we reduce the size of our compiled WASM binary:

File	Before	After
WASM (Uncompressed)	5.01 MB	4.73 MB
WASM (Compressed)	1.78 MB	1.71 MB

Similar Performance

The new implementation actually showed improved performance on the benchmark page. However, this is negligible with a sample size of 1, and is almost certainly within the range of statistical noise.

Nonetheless, I feel encouraged that there is no major or obvious regression coupled with these changes.

Note

Only esfr translation benchmarks were profiled (links in table).
The benchmark metrics in the table related to rates are from distinct runs without the profiler running.

Lang Pair	Sentence Segmentation	Benchmark Time (s)	Words Per Second (wps)	Tokens Per Second (tps)	Max Memory Usage (MB)
`esen`	`ssplit`	4.53	1531.22	2415.62	N/A
`esen`	`Intl.Segmenter`	4.54	1529.19	2412.43	N/A
`esfr`	`ssplit`	9.46	734.11	1158.12	359
`esfr`	`Intl.Segmenter`	9.38	739.58	1166.76	350

gregtatum

I'm leaving my first round of comments, I still need to review 2a50359, but have looked at all the other commits.

inference/src/translator/annotation.h

inference/src/translator/annotation.cpp

inference/BERGAMOT_VERSION

inference/wasm/tests/engine/translations-engine.worker.mjs

inference/src/translator/annotation.cpp

inference/wasm/tests/test-cases/translate-html-multi-paragraph-no-pivot.test.mjs

inference/wasm/tests/models/enzh/lex.50.50.enzh.s2t.bin.gz

inference/wasm/tests/test-cases/translate-html-multi-paragraph-no-pivot.test.mjs

nordzilla · 2024-12-03T00:00:37Z

@gregtatum

Thanks for the review!

I've made changes according to most of your feedback so far (changeset).

Any comments that are not yet resolved have replies.

Happy to keep iterating on this as needed!

gregtatum

Looks good to me! It would be nice to do some performance profiling of the new allocation behavior around the utf8-16 conversions if you haven't already, and also benchmark this on our internal benchmark and maybe a few text heavy wikipedia pages.

inference/src/translator/text_processor.cpp

gregtatum · 2024-12-03T15:42:12Z

Ah I missed the perf numbers at the end. The only potential follow-up would be to run it on a text heavy wikipedia page, but I'm happy with a profile on the benchmark.

This fixes an oversight from a previous PR when `inference-engine` was renamed to `inference`, however the path was not updated in `CODEOWNERS`.

This is a miscellaneous change to the eslint config that now allows different string types based on whether certain types of quotes need to be escaped within the string.

This commit adds a --force-rebuild flag to the WASM build commands that will trigger a rebuild without having to fully clobber and start over.

This commit fixes miscellaneous formatting that I noticed looked misaligned in the terminal. For some reason, some emojis need two spaces after them, when other emojis only need one space to achieve the same alignment.

This commit renames `appendEndingWhitespace` to `handleEndingWhitespace`, because the whitespace logic will be made more complex by this PR, and whitspace is no longer guaranteed to be appended.

This commit adds the capability for several of the C++ classes to register either a source language tag or a target language tag (depending on their needs). I had experimented with changing the constructors themselves, but mtaintaining backward compatibility got messy very quickly with native builds continuing to use `ssplit` and WASM builds now using `Intl.Segmenter`. The least-invasive and cleanest-to-implement compromise that I came up with was to add WASM-specific functionality to register the language tags for classes after construction.

This is the largest commit of the stack, and likely the one to pay the most attention to. In addition to utilizing `Intl.Segmenter` instead of `ssplit` when segmenting text in WASM builds, this patch also necessarily modifies the logic of how whitespce is handled during translations. We now have to concern ourselves with whether the source language and/or target language utilize whitespace between sentences or omit whitespace between sentences. For example: * When translating from Chinese to English, then whitespace must be added between sentences. * When translating from English to Chinese, then whitespace must be removed between sentences. * When translating form Chinese to Japanese, then whitesapce must be inserted between sentences for the English pivot, and then removed for the final output.

This commit entirely removes the build dependency on `ssplit` when building the WASM target. This actually ultimately reduces the size of the compiled WASM binary from 5.01 MB to 4.73 MB.

Part 1 of 2 This commit updates the WASM bindings to take the source language and target language tags in order to construct the TranslationModels that now utilize the locale-specific `Intl.Segmenter`. This effectively takes the `LanguageTranslationModelFiles` object and makes that a sub-object of `TranslationModelPayload`, which includes the language tags as well as the files. This hierarchical separation is ideal, because the `LanguageTranslationModelFiles` object is designed to be iterated over and chunked into aligned memory, where as the language tags are plain strings that are distinctly separate in the way that they are handled.

Part 2 of 2 This commit reworks the TranslationsEngine worker code to utilize the new bindings implemented in the previous commit.

This commit introduces extra logic to the text cleaning that purposely inserts whitespace into CJK text to trick the segmenter into doing the right thing. See the in-code comment for more context.

This commit adds our work-in-progress `zhen` model to the repository for use in testing.

This commit adds several test cases for translating from Chinese into other languages, which will both guard against regressions and demonstrate correct segmentation behavior.

Part 1 of 2 The final two commits of this stack may be slightly controversial. We do not currently have a viable `enzh` model, even for testing purposes, however, I need to test the functionality of removing whitespace between sentences for target languages that require it. This patch adds our `enes` models under the `enzh` directory, which will trick the implementation into translating into "Chinese" with a Spanish output. The key difference is that the Spanish output should not include spaces between sentences, which is, in my opinion, good enough for testing in the interim.

Part 2 of 2 This patch adds test cases for translating into "Chinese", which at present, is actually a Spanish translation that omits spaces between sentences.

ZJaume · 2024-12-10T11:26:12Z

I was able to run our sentence splitting benchmark for ICU4X, see "2022 (mixed) avg" sheet for a summary and "2022 (mixed)" sheet for details on each tested language: https://docs.google.com/spreadsheets/d/1mGJ9MSyMlsK0EUDRC2J50uxApiti3ggnlrzAWn8rkMg/edit?usp=sharing

Not the best, but also no significant flaws.

nordzilla · 2024-12-10T14:26:52Z

I was able to run our sentence splitting benchmark for ICU4X, see "2022 (mixed) avg" sheet for a summary and "2022 (mixed)" sheet for details on each tested language: https://docs.google.com/spreadsheets/d/1mGJ9MSyMlsK0EUDRC2J50uxApiti3ggnlrzAWn8rkMg/edit?usp=sharing

Not the best, but also no significant flaws.

This is really great @ZJaume, thank you for putting this together.

…lds (#945) * Fix inference in CODEOWNERS file This fixes an oversight from a previous PR when `inference-engine` was renamed to `inference`, however the path was not updated in `CODEOWNERS`. * Improve eslint string-formatting configuration This is a miscellaneous change to the eslint config that now allows different string types based on whether certain types of quotes need to be escaped within the string. * Add a --force-rebuild flag to WASM build commands This commit adds a --force-rebuild flag to the WASM build commands that will trigger a rebuild without having to fully clobber and start over. * Fix misc. formatting in build-bergamot.py This commit fixes miscellaneous formatting that I noticed looked misaligned in the terminal. For some reason, some emojis need two spaces after them, when other emojis only need one space to achieve the same alignment. * Rename `appendEndingWhitespace` This commit renames `appendEndingWhitespace` to `handleEndingWhitespace`, because the whitespace logic will be made more complex by this PR, and whitspace is no longer guaranteed to be appended. * Add capability to register languages This commit adds the capability for several of the C++ classes to register either a source language tag or a target language tag (depending on their needs). I had experimented with changing the constructors themselves, but mtaintaining backward compatibility got messy very quickly with native builds continuing to use `ssplit` and WASM builds now using `Intl.Segmenter`. The least-invasive and cleanest-to-implement compromise that I came up with was to add WASM-specific functionality to register the language tags for classes after construction. * Implement WASM segmentation with `Intl.Segmenter` This is the largest commit of the stack, and likely the one to pay the most attention to. In addition to utilizing `Intl.Segmenter` instead of `ssplit` when segmenting text in WASM builds, this patch also necessarily modifies the logic of how whitespce is handled during translations. We now have to concern ourselves with whether the source language and/or target language utilize whitespace between sentences or omit whitespace between sentences. For example: * When translating from Chinese to English, then whitespace must be added between sentences. * When translating from English to Chinese, then whitespace must be removed between sentences. * When translating form Chinese to Japanese, then whitesapce must be inserted between sentences for the English pivot, and then removed for the final output. * Remove WASM dependency on ssplit This commit entirely removes the build dependency on `ssplit` when building the WASM target. This actually ultimately reduces the size of the compiled WASM binary from 5.01 MB to 4.73 MB. * Bump Bergamot Version 0.4.5 => 0.5.0 * Update WASM Bindings Part 1 of 2 This commit updates the WASM bindings to take the source language and target language tags in order to construct the TranslationModels that now utilize the locale-specific `Intl.Segmenter`. This effectively takes the `LanguageTranslationModelFiles` object and makes that a sub-object of `TranslationModelPayload`, which includes the language tags as well as the files. This hierarchical separation is ideal, because the `LanguageTranslationModelFiles` object is designed to be iterated over and chunked into aligned memory, where as the language tags are plain strings that are distinctly separate in the way that they are handled. * Rework TranslationsEngine to utilize new bindings Part 2 of 2 This commit reworks the TranslationsEngine worker code to utilize the new bindings implemented in the previous commit. * Insert whitespace between full-width punctuation and opening quotes This commit introduces extra logic to the text cleaning that purposely inserts whitespace into CJK text to trick the segmenter into doing the right thing. See the in-code comment for more context. * Add `zhen` test model files This commit adds our work-in-progress `zhen` model to the repository for use in testing. * Add test cases for testing `zhen` models. This commit adds several test cases for translating from Chinese into other languages, which will both guard against regressions and demonstrate correct segmentation behavior. * Add temporary `enzh` models for testing Part 1 of 2 The final two commits of this stack may be slightly controversial. We do not currently have a viable `enzh` model, even for testing purposes, however, I need to test the functionality of removing whitespace between sentences for target languages that require it. This patch adds our `enes` models under the `enzh` directory, which will trick the implementation into translating into "Chinese" with a Spanish output. The key difference is that the Spanish output should not include spaces between sentences, which is, in my opinion, good enough for testing in the interim. * Add makeshift `enzh` tests Part 2 of 2 This patch adds test cases for translating into "Chinese", which at present, is actually a Spanish translation that omits spaces between sentences.

nordzilla force-pushed the wasm-intl-segmenter branch from 196c76b to eef64a9 Compare November 26, 2024 23:50

nordzilla self-assigned this Nov 26, 2024

nordzilla added the inference label Nov 26, 2024

nordzilla marked this pull request as ready for review November 27, 2024 01:22

nordzilla requested review from a team as code owners November 27, 2024 01:22

nordzilla requested a review from gregtatum November 27, 2024 01:22

nordzilla force-pushed the wasm-intl-segmenter branch 15 times, most recently from 1caa82e to bbf2551 Compare November 27, 2024 21:09

gregtatum reviewed Dec 2, 2024

View reviewed changes

gregtatum self-requested a review December 2, 2024 21:13

nordzilla mentioned this pull request Dec 2, 2024

Investigate better ways to determine which language scripts require or omit sentence-separating whitespace #950

Open

nordzilla force-pushed the wasm-intl-segmenter branch from bbf2551 to 86f4bab Compare December 2, 2024 23:58

nordzilla force-pushed the wasm-intl-segmenter branch from 86f4bab to 428ab1a Compare December 3, 2024 00:04

gregtatum approved these changes Dec 3, 2024

View reviewed changes

inference/src/translator/text_processor.cpp Show resolved Hide resolved

inference/src/translator/text_processor.cpp Show resolved Hide resolved

inference/src/translator/text_processor.cpp Show resolved Hide resolved

inference/src/translator/text_processor.cpp Outdated Show resolved Hide resolved

nordzilla added 16 commits December 3, 2024 10:18

Fix inference in CODEOWNERS file

bc809ea

This fixes an oversight from a previous PR when `inference-engine` was renamed to `inference`, however the path was not updated in `CODEOWNERS`.

Improve eslint string-formatting configuration

35698f7

This is a miscellaneous change to the eslint config that now allows different string types based on whether certain types of quotes need to be escaped within the string.

Add a --force-rebuild flag to WASM build commands

cc11a1c

This commit adds a --force-rebuild flag to the WASM build commands that will trigger a rebuild without having to fully clobber and start over.

Fix misc. formatting in build-bergamot.py

d12bcd8

This commit fixes miscellaneous formatting that I noticed looked misaligned in the terminal. For some reason, some emojis need two spaces after them, when other emojis only need one space to achieve the same alignment.

Rename appendEndingWhitespace

4a6acd4

This commit renames `appendEndingWhitespace` to `handleEndingWhitespace`, because the whitespace logic will be made more complex by this PR, and whitspace is no longer guaranteed to be appended.

Remove WASM dependency on ssplit

551efad

This commit entirely removes the build dependency on `ssplit` when building the WASM target. This actually ultimately reduces the size of the compiled WASM binary from 5.01 MB to 4.73 MB.

Bump Bergamot Version 0.4.5 => 0.5.0

bb481f5

Rework TranslationsEngine to utilize new bindings

6033192

Part 2 of 2 This commit reworks the TranslationsEngine worker code to utilize the new bindings implemented in the previous commit.

Insert whitespace between full-width punctuation and opening quotes

b455c2d

This commit introduces extra logic to the text cleaning that purposely inserts whitespace into CJK text to trick the segmenter into doing the right thing. See the in-code comment for more context.

Add zhen test model files

e01b2b6

This commit adds our work-in-progress `zhen` model to the repository for use in testing.

Add test cases for testing zhen models.

44ab258

This commit adds several test cases for translating from Chinese into other languages, which will both guard against regressions and demonstrate correct segmentation behavior.

Add makeshift enzh tests

63c4022

Part 2 of 2 This patch adds test cases for translating into "Chinese", which at present, is actually a Spanish translation that omits spaces between sentences.

nordzilla force-pushed the wasm-intl-segmenter branch from 428ab1a to 63c4022 Compare December 3, 2024 16:18

nordzilla merged commit f901047 into main Dec 3, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use `Intl.Segmenter` instead of `ssplit` for segmentation in WASM builds #945

Use `Intl.Segmenter` instead of `ssplit` for segmentation in WASM builds #945

nordzilla commented Nov 26, 2024 •

edited

Loading

gregtatum left a comment •

edited by nordzilla

Loading

nordzilla commented Dec 3, 2024 •

edited

Loading

gregtatum left a comment

gregtatum commented Dec 3, 2024

ZJaume commented Dec 10, 2024

nordzilla commented Dec 10, 2024

Use Intl.Segmenter instead of ssplit for segmentation in WASM builds #945

Use Intl.Segmenter instead of ssplit for segmentation in WASM builds #945

Conversation

nordzilla commented Nov 26, 2024 • edited Loading

Description

Correctness Improvements

Upstream Updates

Compiled WASM Binary Size Reduction

Similar Performance

gregtatum left a comment • edited by nordzilla Loading

Choose a reason for hiding this comment

nordzilla commented Dec 3, 2024 • edited Loading

gregtatum left a comment

Choose a reason for hiding this comment

gregtatum commented Dec 3, 2024

ZJaume commented Dec 10, 2024

nordzilla commented Dec 10, 2024

Use `Intl.Segmenter` instead of `ssplit` for segmentation in WASM builds #945

Use `Intl.Segmenter` instead of `ssplit` for segmentation in WASM builds #945

nordzilla commented Nov 26, 2024 •

edited

Loading

gregtatum left a comment •

edited by nordzilla

Loading

nordzilla commented Dec 3, 2024 •

edited

Loading