Machine learning & ACT Rules #2113
Replies: 3 comments 2 replies
-
Do ACT test cases work with ML?ACT test cases are generally written to be as basic as they possibly can be. Only information necessary for evaluating the test case is included, with an occasional exception for improving accessibility of the test case. This makes test cases very unusual. Real web pages do not look like ACT test cases. They have a heading, usually a logo, a navigation bar, etc. "Smart" models can easily be tripped up by the absence of common feature like that. That's not just true for visual heuristics either. Our test cases often only minimal attributes, whereas a real world page has things like a cursor to non-standard controls, it has event listeners, it has behavior that can be triggered by activating the control. The unrealistic nature of ACT test cases puts an ML implementation of them in question. If the implementation is not consistent, is that because the test cases is "odd", or because the implementation has flaw? Or the inverse of that, if an ML based implementation does show consistency, does it mean that it will behave correctly in real-world scenarios? |
Beta Was this translation helpful? Give feedback.
-
Training on ACT test casesWhen training a machine learning model it is standard practice to separate test data from training data. This is done so that when the accuracy of a model is calculated, it will be more realistic reflection of how the model will perform on real-world scenarios, which presumably it hasn't trained on either. Since the test cases are used as a way to measure the consistency of tools with the description of a rule, at first glance it seems reasonable to say that ACT test cases cannot be used to train ML models who's consistency will then be checked with those same test cases. The other side of that argument though is that implementations that do not use machine learning in practice do actually use the test cases as indicators for where their tools need to be improved. Those improvements are coded up manually, but nevertheless serve as input data. Arguably not allowing ACT implementor to use test cases for machine learning sets them at a disadvantage. Another thing to consider here is that accessibility tools can use modals that are trained for specific website. These modals may perform much better on sites it has seen before than against pages it has never seen. One way to look at that is that tools trained against ACT Test Cases may just fit into a different category from tools that weren't. |
Beta Was this translation helpful? Give feedback.
-
How does confidence fit with ACT consistency?ACT Rules have so far been written to expect a definitive answers. Something either passes or fails, and if the implementor isn't sure that gets reported as a cantTell. CantTell is for example used for tools that are unable to test color contrast on background images. That seems appropriate as the tool literally has no ability to determine the answer. This is different for tools that report predictions rather than confident answers. A basic example of that is language detection. Language detection can give the most likely language, and a percentage of the chance it thinks this is correct, but the confidence / accuracy of that is never 100%. Reporting everything as "cantTell" on a predictive implementation won't be useful for determining consistency. Another could be to let the implementor decide at what confidence level that prediction switches from a cantTell to a fail or pass. Either because that is the default of the tool, or because that was the number that worked best for getting rules to be reported as consistent. That may create undesirable difference between how tools are treated, so other options may be never to use cantTell for predictions and always report pass or fail, or for the W3C to decide what the confidence threshold should be when reporting for ACT implementations. Not all machine learning models are deterministic. A test case may be reported as passing one day, and as failing another. That is especially likely if cantTell cannot be used to report the "less confident" cases. Pushed to the extreme, if given a choice on how to determine confidence, an implementor looking to maximize (and arguably game) their consistency numbers could choose to vary it so that it fails everything it should fail, without ever passing anything it shouldn't. Another consideration here is whether all test cases should be expected to be correct. Predictive results by their nature can be wrong. So it might be appropriate for a margin of error to exist when deciding on the consistency of an implementation, at least in some situations. The downside there though is that ACT test cases are fairly minimal, and if one test case is consistently reported as incorrect, that could mean a potentially significant problem. |
Beta Was this translation helpful? Give feedback.
-
With machine learning playing an increasingly large role in accessibility testing it is time for the ACT community to have some conversations about this. The core question I believe is this one:
From that, I think a number of topics flow that I would like to see this group discuss.
Beta Was this translation helpful? Give feedback.
All reactions