Member-only story

accents, identity and techno-utopianism in the era of dumb ML

Prionti Nasir
7 min readJul 29, 2021

--

Source: David Knickerbocker

0. Context to context

Throughout last year, I worked with the National Technical Institute for the Deaf on a new metric, ACE2, that evaluates the efficacy of automatic speech recognition software. Speech recognition products have traditionally been evaluated using a metric self-explanatorily called the word-error rate (WER), but it has been found to be have little correlation with caption-understandability in human subject research. Quoting WER’s Wikipedia page, “One problem with using a generic formula such as the one above, however, is that no account is taken of the effect that different types of error may have on the likelihood of successful outcome, e.g. some errors may be more disruptive than others and some may be corrected more easily than others.” (Simply looking at the edit distance between an intended word and the corresponding error does not say much about how successful the speech recognition attempt was in conveying intended meaning to the user??? Crazy.)

The new metric is supposed to put more focus on error impact, i.e. how much meaning is lost from the speech because of an error. For instance, it takes into account the semantic significance of an “apple” -> “a bowl” transformation, as well as the semantic insignificance of a “that” -> “the” transformation.

We conducted experiments with both hearing and hard-of-hearing participants, where one person would narrate a script to another through Google Speech-to-Text or IBM Watson. The input script would be revealed at the end and the person who was at the receiving end would review conversational quality based on how closely the meaning they retained from the captions matched the intended speech. We would apply the ACE2 and WER metrics on the input and output scripts and compare the metric-generated scores with the participants’ reviews. Generally, the ACE2 metric’s score for an output aligned noticeably more closely with the participants’ reviews of the output, which obviously implied that the new metric had a better sense of caption quality, and the researchers returned home happy.

1. Context

There were unsurprising patterns to notice while conducting these experiments. Both Google Speech-To-Text and IBM Watson were especially unsuccessful while…

--

--

Prionti Nasir
Prionti Nasir

No responses yet