accents, identity and techno-utopianism in the era of dumb ML
0. Context to context
Throughout last year, I worked with the National Technical Institute for the Deaf on a new metric, ACE2, that evaluates the efficacy of automatic speech recognition software. Speech recognition products have traditionally been evaluated using a metric self-explanatorily called the word-error rate (WER), but it has been found to be have little correlation with caption-understandability in human subject research. Quoting WER’s Wikipedia page, “One problem with using a generic formula such as the one above, however, is that no account is taken of the effect that different types of error may have on the likelihood of successful outcome, e.g. some errors may be more disruptive than others and some may be corrected more easily than others.” (Simply looking at the edit distance between an intended word and the corresponding error does not say much about how successful the speech recognition attempt was in conveying intended meaning to the user??? Crazy.)
The new metric is supposed to put more focus on error impact, i.e. how much meaning is lost from the speech because of an error. For instance, it takes into account the semantic significance of an “apple” -> “a bowl” transformation, as well as the semantic insignificance of a “that” -> “the” transformation.