Machine learning & ACT Rules #2113

WilcoFiers · 2023-09-08T13:26:16Z

WilcoFiers
Sep 8, 2023
Maintainer

With machine learning playing an increasingly large role in accessibility testing it is time for the ACT community to have some conversations about this. The core question I believe is this one:

How can ACT rules continue to facilitate harmonized accessibility testing with tools increasingly relying on machine learning techniques.

From that, I think a number of topics flow that I would like to see this group discuss.

WilcoFiers · 2023-09-08T13:38:59Z

WilcoFiers
Sep 8, 2023
Maintainer Author

Do ACT test cases work with ML?

ACT test cases are generally written to be as basic as they possibly can be. Only information necessary for evaluating the test case is included, with an occasional exception for improving accessibility of the test case.

This makes test cases very unusual. Real web pages do not look like ACT test cases. They have a heading, usually a logo, a navigation bar, etc. "Smart" models can easily be tripped up by the absence of common feature like that. That's not just true for visual heuristics either. Our test cases often only minimal attributes, whereas a real world page has things like a cursor to non-standard controls, it has event listeners, it has behavior that can be triggered by activating the control.

The unrealistic nature of ACT test cases puts an ML implementation of them in question. If the implementation is not consistent, is that because the test cases is "odd", or because the implementation has flaw? Or the inverse of that, if an ML based implementation does show consistency, does it mean that it will behave correctly in real-world scenarios?

2 replies

WilcoFiers Sep 8, 2023
Maintainer Author

One possible thing to consider is that we could "extend" the templating we do for ACT test cases. Right now we add a simple HTML and title element to a test case that doesn't have one and leave it at that. We could go well beyond that and wrap each test case in a more believable web page. That doesn't work everywhere, and it won't solve all problems but it might be much better than what we're doing today.

tbostic32 Sep 15, 2023
Maintainer

Should we have any explainability requirement for the ML models? Explainability is a very new problem in the ML space and is largely unsolved. For example, even if you know that a model tells you why it gave the answer: 1. Is it lying, and just telling you want to see? 2. Can it give us some notion of what features it used specifically to calculate the answer and we include that in a report somehow.

WilcoFiers · 2023-09-08T13:58:34Z

WilcoFiers
Sep 8, 2023
Maintainer Author

Training on ACT test cases

When training a machine learning model it is standard practice to separate test data from training data. This is done so that when the accuracy of a model is calculated, it will be more realistic reflection of how the model will perform on real-world scenarios, which presumably it hasn't trained on either.

Since the test cases are used as a way to measure the consistency of tools with the description of a rule, at first glance it seems reasonable to say that ACT test cases cannot be used to train ML models who's consistency will then be checked with those same test cases.

The other side of that argument though is that implementations that do not use machine learning in practice do actually use the test cases as indicators for where their tools need to be improved. Those improvements are coded up manually, but nevertheless serve as input data. Arguably not allowing ACT implementor to use test cases for machine learning sets them at a disadvantage.

Another thing to consider here is that accessibility tools can use modals that are trained for specific website. These modals may perform much better on sites it has seen before than against pages it has never seen. One way to look at that is that tools trained against ACT Test Cases may just fit into a different category from tools that weren't.

0 replies

WilcoFiers · 2023-09-08T16:36:28Z

WilcoFiers
Sep 8, 2023
Maintainer Author

How does confidence fit with ACT consistency?

ACT Rules have so far been written to expect a definitive answers. Something either passes or fails, and if the implementor isn't sure that gets reported as a cantTell. CantTell is for example used for tools that are unable to test color contrast on background images. That seems appropriate as the tool literally has no ability to determine the answer.

This is different for tools that report predictions rather than confident answers. A basic example of that is language detection. Language detection can give the most likely language, and a percentage of the chance it thinks this is correct, but the confidence / accuracy of that is never 100%.

Reporting everything as "cantTell" on a predictive implementation won't be useful for determining consistency. Another could be to let the implementor decide at what confidence level that prediction switches from a cantTell to a fail or pass. Either because that is the default of the tool, or because that was the number that worked best for getting rules to be reported as consistent. That may create undesirable difference between how tools are treated, so other options may be never to use cantTell for predictions and always report pass or fail, or for the W3C to decide what the confidence threshold should be when reporting for ACT implementations.

Not all machine learning models are deterministic. A test case may be reported as passing one day, and as failing another. That is especially likely if cantTell cannot be used to report the "less confident" cases. Pushed to the extreme, if given a choice on how to determine confidence, an implementor looking to maximize (and arguably game) their consistency numbers could choose to vary it so that it fails everything it should fail, without ever passing anything it shouldn't.

Another consideration here is whether all test cases should be expected to be correct. Predictive results by their nature can be wrong. So it might be appropriate for a margin of error to exist when deciding on the consistency of an implementation, at least in some situations. The downside there though is that ACT test cases are fairly minimal, and if one test case is consistently reported as incorrect, that could mean a potentially significant problem.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine learning & ACT Rules #2113

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Machine learning & ACT Rules #2113

WilcoFiers Sep 8, 2023 Maintainer

Replies: 3 comments · 2 replies

WilcoFiers Sep 8, 2023 Maintainer Author

Do ACT test cases work with ML?

WilcoFiers Sep 8, 2023 Maintainer Author

tbostic32 Sep 15, 2023 Maintainer

WilcoFiers Sep 8, 2023 Maintainer Author

Training on ACT test cases

WilcoFiers Sep 8, 2023 Maintainer Author

How does confidence fit with ACT consistency?

WilcoFiers
Sep 8, 2023
Maintainer

Replies: 3 comments 2 replies

WilcoFiers
Sep 8, 2023
Maintainer Author

WilcoFiers Sep 8, 2023
Maintainer Author

tbostic32 Sep 15, 2023
Maintainer

WilcoFiers
Sep 8, 2023
Maintainer Author

WilcoFiers
Sep 8, 2023
Maintainer Author