Evaluation Criteria for Conformance Models #34

rachaelbradley · 2023-11-06T20:44:06Z

rachaelbradley
Nov 6, 2023
Maintainer

When we look across conformance models, it will help to have a set of criteria to use to evaluate and compare the variations.

The proposed criteria from the 2011 WAI Symposium on Accessibility Metrics are:

Validity - does the overall score reflect the accessibility of the product?
Reliability - is the overall score reproducible and consistent? This is often noted in meetings as "repeatability"
Sensitivity - does a change in the score reflect the change in the accessibility?
Adequacy - does a small change at the guideline level in accessibility create a small scoring change?
Complexity - does it take a reasonable amount of time to test?

Additional proposed criteria from AGWG discussion:

Equity - Does the approach support equity across disability categories?
Comprehension - Is the proposed conformance model easy to understand?

Question for Discussion: What evaluation criteria are missing or need to be adjusted?

Note: No conformance model will meet all these criteria. The purpose of this discussion thread is to identify whether there are any other criteria we should be using to compare and discuss each model or whether we should revise/remove any of the ones listed.

jspellman · 2023-11-07T14:08:18Z

jspellman
Nov 7, 2023
Collaborator

For details of how the metrics from the WAI Symposium were applied to WCAG3, see Metrics and Plan for Evaluating Conformance Scoring for WCAG 3 from the Conformance Architecture Testing subgroup.

0 replies

WilcoFiers · 2023-12-04T13:19:18Z

WilcoFiers
Dec 4, 2023
Collaborator

@rachaelbradley Can you explain why you think WCAG 3 needs a conformance model that is built on a scoring system? I know there are some difficulties that we're hoping to solve with this. I think the big one is that some people have been asking that WCAG 3 allow sites with minor issues to conform. That has gotten push-back both from the "WCAG is a ruler not the rule" crowd, and from the "no exceptions" crowd. Until we have agreement conformance can ever be less than meet 100% of the requirements, I don't think we can make that a requirement for the conformance model.

Second is that I'm somewhat skeptical that it's even possible to come up with a scoring system that meets all those criteria. Especially the reliable and equitable parts. Lets say we have a scoring of 1 - 10. Instead of needing to ensure that we have an equitable conformance level, we now have to ensure that we have 10 equitable grades. It gets harder the more granularity you add to that. And on the reliability front, that gets harder the more you allow testers to decide what is and isn't important / critical / essential for someone with a disability. That feels inherently problematic and ableist to me.

I do appreciate there are significant challenges that come from full conformance being this almost unachievable target for a lot of organizations. That feels much more like a policy problem than a standards problem. Policies should describe to what extent okay for things to be imperfect, and what measures an organization should provide to compensate for these shortcomings. There is far more flexibility in that space than there is in WCAG 3's conformance model to consider things like how responsive is the help desk, what non-web alternatives are available, how quickly can issues be resolved, etc.

1 reply

rachaelbradley Dec 4, 2023
Maintainer Author

This discussion is to discuss the evaluation criteria for conformance not the options themselves. The question raised in this thread that I think fits within this topic is whether it is possible to come up with a scoring system that meets all the listed criteria. (chair hat on)

@WilcoFiers I don't necessarily think the conformance model needs to be a scoring system. I think that approach is one possibility of a number of options that have been proposed.

With regards to whether a conformance model can meet all the criteria, I agree that it may not be possible to meet all of them completely. A tension exists between many of those listed. I think it will be a matter of deciding which criteria we want to use to evaluate so when we look at the options, we know how we are measuring the tradeoffs. (chair hat off)

julierawe · 2024-01-03T22:14:20Z

julierawe
Jan 3, 2024
Collaborator

Please be sure to include equity as one of the criteria for evaluating conformance models. As we compare models, we need an approach that supports equity across the disability categories. This is a must.

1 reply

alastc Jan 16, 2024
Maintainer

Would that come under validity? i.e. "the extent to which the measurements obtained by a metric reflect the accessibility of the website"

julierawe · 2024-01-03T22:18:03Z

julierawe
Jan 3, 2024
Collaborator

Re complexity/taking "a reasonable amount of time to test": This makes sense as a criterion, but it's also concerning. Are we basically asking how much of the testing can be automated?

The Metrics and Plan for Evaluating Conformance Scoring for WCAG 3 that Jeanne shared above says the way to test conformance models for complexity is to "ask experts to run a test on a site where they know how long it took to test with WCAG2 and compare how long it took to test with WCAG3."

But one goal of WCAG 3 is to cover more user needs, including more needs of people with cognitive disabilities. Will an emphasis on test time mean we're less likely to cover certain user needs? Or that we'll have to cover them in a less rigorous way, such as an assertion?

As we look at the current set of criteria, will equity serve as a balance to testing time? Will each criterion have the same weight?

0 replies

u9000-Nine · 2024-01-04T19:32:31Z

u9000-Nine
Jan 4, 2024
Collaborator

Re: Adequacy

I think this should be "Proportionality" instead. As this currently stands, the title and definition allow for a large change at the guideline level to create a small scoring change. I feel like guideline changes should create large scoring changes, and small guideline changes should create small scoring changes.

1 reply

jspellman Jan 12, 2024
Collaborator

DJ, the title and definition came from a Research Symposium that W3C WAI held in 2013. I would agree that it is not easy to remember what this term stands for -- I often have to look it up. It does mean that a large change would have a large scoring change as well as a small change to guidelines results in a small scoring change. The original papers and the report from the symposium are very interesting reading. This is a group of experts who thought deeply about measuring accessibility.

lseeman · 2024-01-07T12:12:28Z

lseeman
Jan 7, 2024
Collaborator

We should be awear that versions of WCAG had not fully managed to support all disabilities equally. These criteria failed at creating equality, or at putting the user first. Therefor we must be careful to not repeat the same error and exclusion. Just because groups were excluded in the past does not mean they should be excluded in the future.

Complexity and repeatability? no thank you. Not when compared to the need to meet our core mandate to make guidelines that tell content creators how to include people with disabilities. To me this is not just making tests to meet the needs of the testers. Does the content meet the needs of the users? does it favor groups over others? Do design choices make the affects of disabilities worse, or even create disability?

Equity, equality.

3 replies

alastc Jan 16, 2024
Maintainer

Hi Lisa,

This comes across to me as saying: Don't use a conformance model, just provide guidance.

Is that the intent? Or, do you mean that cost/feasibility/time shouldn't be a factor for defining the conformance model?

lseeman Jan 18, 2024
Collaborator

not at all. but cost/feasibility/time will improve when there is a market. That is what happened with wcag1.0! tools come after requirments

lseeman Jan 18, 2024
Collaborator

we run the risk of priotizing supporting peoples business model over user needs. that is what these requirements are heading towards (again)

jmcsorle · 2024-01-08T17:39:36Z

jmcsorle
Jan 8, 2024

I think it's very important to ensure that user testing is recognized as a valid way to test for reliability. There will always be aspects of accessibility that don't fit into a black and white, automated testing scheme -- which is why we should not discount the importance of well-structured user testing. Well-structured user testing follows a measurable protocol that identifies patterns in user feedback. These patterns are what equate to "consistency" and "reproducibility".

"Repeatability" does not mean "fast" or "convenient" or "cheap." We need to focus on the intent of the term and recognize that if we are truly concerned about including historically excluded populations in these guidelines, then we will validate user testing as a necessary tool for measuring accessibility for certain populations and certain criteria.

3 replies

julierawe Jan 8, 2024
Collaborator

Jan is raising a very important concern about whether/how "reliability" is included as one of the evaluation criteria. Thank you, Jan!

alastc Jan 16, 2024
Maintainer

hi @jmcsorle,

I'm not sure what you're proposing for this? Do you mean that we compare the results of guideline-testing with usability-testing? Or that we have usability-testing as a method for passing?

(Chair hat off) If the latter, I'm not convinced that usability testing will be a good way to meet a guideline in enough scenarios. Even when testing with non-disabled people you need a lot of users to find 85% of the issues. Including people with disabilities multiplies the number of users you would need. It would also be very difficult (perhaps impossible) to know that the testing was well-structured and unbiased.

That isn't say that usability-testing is not a useful tool, it is, and it's where I started. But it will need to be as an 'assertion' (e.g. we've conducted usability testing and acted on the results) rather than something to meet a concrete guideline.

If we don't include reliability as a criteria for guidelines, I think that undermines the whole effort. Unreliable guidelines will not be taken on. (NB: That's obviously a scale, no set of guidelines are 100% reliable, but they should at least try to be mostly reliable most of the time.)

jmcsorle Jan 22, 2024

Hi Alastair,

I am not proposing that we not have reliability as a criteria. I am proposing that we not assume that user testing cannot return reliable results and that it will be a necessary tool for testing criteria for which automated testing is not possible. Just because something cannot be tested with automation with today's tools does not mean it is not a critical need. My concern is that we might be putting more value on automation than on user needs.

I share the concerns of other members of the COGA task force that if COGA criteria are relegated to silver or gold levels that they will once again not be a part of the base requirements that most companies try to adopt. Most sites today are not accessible, so assuming that companies will try to do more than the bare minimum is likely wishful thinking.

Simply putting new names on measurement requirements is not likely to change the fact that the guidelines are complex and people tend to shoot for minimal conformance over quality - not everyone does this, but enough for over 90% of websites to still not meet conformance requirements today. I believe we have to at least consider this or we will just end up with a similar problem. I just think it's important to keep the user at the center of our decision making as we discuss our options for how to develop guidelines and measurements that are more inclusive.

I am in agreement with you that reliability is an important criteria. I simply do not believe that automated testing is the only way to measure reliability. I acknowledge that there are challenges unique to user testing, but it is discipline with defined best practices. I think it is worth considering how it could be leveraged for measuring certain criteria so that the needs those criteria are intended to meet are not put on hold, or relegated to levels of conformance that very few people will aspire to meet.

detlevhfischer · 2024-01-16T15:58:27Z

detlevhfischer
Jan 16, 2024
Collaborator

Validity, Sensitivity and adequacy as defined above seem not so much separate but interlinked and dependent both on granularity of the scale and the availablitiy of additional conditions captured. In the scoring example cited this is the "critical error" device which overrules the arithmetic approach (say, dividing the number of images with appropriate alt by the total number).

The mix of instances falling under a particular guideline, each of which can have different impact when not implemented properly, makes an arithmetic approach dubious. As a human tester looking at, say, all images on a page, I process both the qualitative aspect, the estimated impact from high (say, missing name on a critical image based control) to low (say, bad alt on a teaser image that is followed by a linked teaser heading) as well as the quantitative aspect (how many images are we talking here). So I simultaneously process qualitative and quantitatative information in determining where to rate it on our 5-point Likert scale. (which btw also maps on WCAG pass/fail, losing granularity by doing so).

Obviously, there is some subjectivity in that rating, and some other evaluators may arrive at a different result. But it safes evaluators going though a detailed (and by that token, complex and time-consuming) process as described in the scoring example. One could argue evaluation would hardly be efficiently doable otherwise - and WCAG 2.X PASS/FAIL raters apply something similar today in deciding whether content is still within tolerances of "PASS". Would doing it that complex way as in the scoring example improve replicability? Possibly - but at the penalty of being captive of a process that is much more time consuming than assessing the likely impact of issues on a page and doing that calculation as part of an expert assessment that thankfully needs not be explicit to the dot (even though it could be explained and laid out in more detail in any post mortem analysis).

0 replies

charleshalldesign · 2024-01-16T20:56:05Z

charleshalldesign
Jan 16, 2024

When the Silver Task Force created conformance prototypes in what was called phase 3 circa 2018 – 19, we had and/or created evaluation criteria to compare and discuss each. I do not recall this list of criteria from a 2011 symposium being among them.

Please ensure that those evaluation criteria are also considered.

Other criteria that I can imagine:

scope. if the overall score of product A is the aggregate result of several high scoring atomic test items, and of product B is the aggregate result of two large and complex user processes, is the scope sufficient and fair?
longevity. can the score meaningfully persist beyond the moment of evaluation if some established threshold of change occurs?
scale. as new needs-based outcomes are identified with new techniques which are supported by new tests, is the new score reasonably relative to the original score? (or will score be version based like 3.x dot release?)

0 replies

rachaelbradley · 2024-01-18T17:05:46Z

rachaelbradley
Jan 18, 2024
Maintainer Author

Just a note that these are considerations to discuss when evaluating conformance and not requirements. The purpose is that we understand tradeoffs.Sent from my iPhoneOn Jan 18, 2024, at 11:48 AM, Lisa Seeman ***@***.***> wrote: we run the risk of priotizing supporting peoples business model over user needs. that is what these requirements are heading towards (again) —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

0 replies

lseeman · 2024-01-19T08:26:06Z

lseeman
Jan 19, 2024
Collaborator

This comment is from the COGA subgroup.

As we consider WCAG 3.0, we need to be mindful that equity is foundational to ensuring the representation of historically marginalized groups. In the context of WCAG, equity means ensuring users with any disability or combination of disabilities have a digital experience functionally equivalent to the experience provided to people without disabilities. As the group responsible for writing WCAG 3.0, we must guard ourselves against falling into subjective measures that can lead to exclusion. For example, arguing a proposed guideline should meet the needs of a minimum, arbitrary number of people, is an ableist bias that has led to the exclusion of certain types of disabilities in previous versions of WCAG.

We must be committed to equity for equity's sake. We cannot allow ableist bias to creep into our justifications for accepting or denying guidelines into the levels of WCAG that will be bound to legislation and broadly adopted as the baseline of accessibility by corporations. We must also avoid dismissing the needs of users simply because solutions to address their needs might require testing beyond what can be easily automated.

Accessibility must be similar for different groups of disabilities and different combinations of disabilities at any conformance level. The needs of users must be central to our reasons for including or excluding guidelines and how we rank those guidelines. For the basic accessibility needs of some groups to be considered Bronze, while other groups are relegated to Silver or Gold, would be a disaster, and would simply be repeating history.

0 replies

GreggVan · 2024-01-22T22:37:39Z

GreggVan
Jan 22, 2024
Collaborator

In some comments the need for something to be testable seems to be interpreted as having to be automatically testable. We have not ever had a requirement that provisions need to be automatically testable in order to be included. Just testable with inter-evaluator reliable results.

best

0 replies

ChrisLoiselle · 2024-10-17T13:31:06Z

ChrisLoiselle
Oct 17, 2024
Collaborator

Hi, just posting https://w3c.github.io/wcag/conformance-challenges/ here as well. I see related links such as the Research Report , but wanted to post the challenges link here too for reference's sake.

0 replies

Evaluation Criteria for Conformance Models #34

rachaelbradley Nov 6, 2023 Maintainer

Question for Discussion: What evaluation criteria are missing or need to be adjusted?

Replies: 13 comments · 9 replies

jspellman Nov 7, 2023 Collaborator

WilcoFiers Dec 4, 2023 Collaborator

rachaelbradley Dec 4, 2023 Maintainer Author

julierawe Jan 3, 2024 Collaborator

alastc Jan 16, 2024 Maintainer

julierawe Jan 3, 2024 Collaborator

u9000-Nine Jan 4, 2024 Collaborator

Re: Adequacy

jspellman Jan 12, 2024 Collaborator

lseeman Jan 7, 2024 Collaborator

alastc Jan 16, 2024 Maintainer

lseeman Jan 18, 2024 Collaborator

lseeman Jan 18, 2024 Collaborator

jmcsorle Jan 8, 2024

julierawe Jan 8, 2024 Collaborator

alastc Jan 16, 2024 Maintainer

jmcsorle Jan 22, 2024

detlevhfischer Jan 16, 2024 Collaborator

charleshalldesign Jan 16, 2024

rachaelbradley Jan 18, 2024 Maintainer Author

lseeman Jan 19, 2024 Collaborator

GreggVan Jan 22, 2024 Collaborator

ChrisLoiselle Oct 17, 2024 Collaborator

rachaelbradley
Nov 6, 2023
Maintainer

Replies: 13 comments 9 replies

jspellman
Nov 7, 2023
Collaborator

WilcoFiers
Dec 4, 2023
Collaborator

rachaelbradley Dec 4, 2023
Maintainer Author

julierawe
Jan 3, 2024
Collaborator

alastc Jan 16, 2024
Maintainer

julierawe
Jan 3, 2024
Collaborator

u9000-Nine
Jan 4, 2024
Collaborator

jspellman Jan 12, 2024
Collaborator

lseeman
Jan 7, 2024
Collaborator

alastc Jan 16, 2024
Maintainer

lseeman Jan 18, 2024
Collaborator

lseeman Jan 18, 2024
Collaborator

jmcsorle
Jan 8, 2024

julierawe Jan 8, 2024
Collaborator

alastc Jan 16, 2024
Maintainer

detlevhfischer
Jan 16, 2024
Collaborator

charleshalldesign
Jan 16, 2024

rachaelbradley
Jan 18, 2024
Maintainer Author

lseeman
Jan 19, 2024
Collaborator

GreggVan
Jan 22, 2024
Collaborator

ChrisLoiselle
Oct 17, 2024
Collaborator