The Promise and Pitfalls of China's AI Sign Language Interpreters

9 hours ago 2

In 2022, Chinese television viewers who tuned into the Beijing Winter Olympics and Paralympics likely noticed something different about the coverage. For the first time, several major TV channels had digital avatars providing real-time sign language translation for their live broadcasts. The following month, Alibaba’s DAMO Academy officially launched a sign language avatar called Xiaomo to serve the 4th Asian Para Games, held in the eastern city of Hangzhou. And at year’s end, telecom giant China Mobile and Migu Video jointly launched their own avatar to provide sign language commentary for the 2022 World Cup in Qatar.

This sudden surge in sign language avatars came in response to the Chinese government’s call for improving access for people with disabilities through technology. While the 2008 Law on the Protection of Persons with Disabilities clearly states that public service institutions and public places should “provide disabled persons with voice and word hints, sign language, Braille and other information communication services,” implementation had long been spotty, in part due to a shortage of interpreters.

There are 20.5 million people in China with hearing disabilities, according to nationwide survey data, many of whom communicate using sign language. Yet only a handful of universities and colleges in the country offer sign language interpreting majors, leaving gaps in a number of fields from medicine to law and education.

To address the shortfall, in 2020, China’s State Administration of Radio, Film and Television proposed the development of a technical plan for sign language avatars and the promotion of virtual anchors, sign language animation, and other technologies in news reports, variety shows, weather forecasts, and educational programs.

Interestingly, however, these avatars have been best received by those viewers outside the Deaf community. While media and online commentators see the development as a rare example of technology being a force for good, many Deaf Chinese and sign language interpreters have been either skeptical or outright resistant to the avatars, complaining that they are difficult to understand, use stiff movements, lack expression, and occasionally fall into the uncanny valley.

To better understand these concerns, my research team and I investigated the use of sign language avatars at the Beijing Winter Olympics from the perspectives of both professionals and ordinary users. We transcribed and back-translated the sign language created by the avatars, then compared the results with the original audio, finding that a significant amount of key information was lost or distorted in the AI-generated version.

Deaf people reported having difficulty accurately understanding the signing regardless of whether they watched the video directly or read the transcribed records. On closer inspection, the movements of the avatars differed considerably from everyday sign language in terms of hand shape, position, direction, and movement. Other issues were even more prominent — the avatars’ facial expressions and body language were off, and their mouth movements were distorted. The group of users we interviewed said that they generally couldn’t understand the avatars’ movements and noted that they seemed to have a limited vocabulary, while struggling to handle words with multiple meanings.

Many AI products are already able to complete translation and broadcasting tasks with a high degree of accuracy. So, why have sign language avatars performed so poorly? Partly, it’s due to the complexity of sign language translation — but the avatars’ struggles also reflect the absence of Deaf perspectives and blindspots in the developers’ approach.

Technical difficulties

The first and most crucial issue that developers of sign language avatars overlook is the difference between signed and spoken language. In particular, many perceive sign language as an accessory to spoken language, or else believe that translating between the two is similar to translating between two spoken languages.

But the modalities of spoken language and sign language are quite different. The former is an oral-auditory language, while the latter is a visual-gestural or visual-spatial language. The term “gestural” is a relatively broad concept that includes not just hand movements, but also facial expressions and body language. Full utilization of the body in space allows sign language users to express the meaning of an entire sentence — such as “a person walks into a room” — with just one action.

This means that sign language and spoken language differ both in terms of vocabulary and grammar. According to the researcher Wu Ling, corresponding Chinese words cannot be found for the meanings expressed by 50% of gestures in Chinese sign language. Indeed, tests of the major domestic sign language avatars conducted by my research team suggest that these products perform well when expressing sentences that have similar word orders in sign language and spoken Chinese, but struggle when it comes to simple sentences that involve spatial orientation or simultaneity, such as “fish are swimming in water.”

Another obstacle facing developers of sign language avatars is the diversity of sign language. Even within Chinese Sign Language, there are not only different dialects, but also distinctions between “natural sign language,” which originates from the daily lives of Deaf people, and “signed Chinese language,” which is an expression of Chinese characters using signs. The language used by most Chinese Deaf people lies somewhere on a spectrum between the two. To ensure that sign language avatars can be understood by as many users as possible, it’s therefore necessary to strike a balance between the different styles.

This is hard to achieve due to the lack of a key element: data. The limited corpus of sign language that companies can draw on means that the current commercial solutions generally adopt a two-step approach. Text is first transcribed, and then these transcriptions are converted into gestures. This approach reduces costs significantly, but essentially treats sign language as a single linear language, thereby losing its inherent simultaneity and spatiality and deviating from the language’s true nature.

Unheard feedback

The various technical issues above point to real problems in the development process, but even if they are addressed, avatars will be unable to effectively serve the Deaf community unless the opinions and perspectives of Deaf people are integrated into the entire research, development, and application process.

As far back as 2001, French scholar Annelies Braffort pointed out that the vast differences in the ways oral languages and sign language function mean that if computer scientists fail to cooperate closely with sign language linguists and Deaf people, they will not be able to ensure the effectiveness and ethicality of their sign language translation programs. This form of cooperation remains important in the AI era, especially when sign language corpora are insufficient or of poor quality. It is even more necessary for sign language linguists — and Deaf linguists in particular — to participate closely in the research and development process, contributing their specialized knowledge to help compensate for the insufficient data available to developers.

Unfortunately, most tech companies currently developing sign language avatars have not involved sign language linguists or Deaf people in any great depth. Even in the cases where sign language teachers or interpreters are included, developers often only slot them into supporting roles, instead of taking the opinions of Deaf users as the final arbiter of their products’ effectiveness.

As a former consultant in the development of a sign language avatar by a major Chinese tech company, I have personal experience with this approach. The team was clearly passionate about improving accessibility, but they seemed to underestimate the difficulty involved, overestimated the power of tech to solve problems, and lacked the necessary experience, resources, and ability to judge the quality of work done by third-party companies. By the time I joined the project, these shortcomings had already become apparent. Although the development team welcomed my participation, I felt that this respect was more for my technical knowledge as a university professor rather than my identity as a Deaf person. This is probably also the reason why when I pointed out that the product wouldn’t meet users’ expectations, my feedback was not fully embraced, as the developers seemed unable to fully empathize with my frustrations.

There are also fundamental issues with the way tech approaches the problem of sign language translation. Tech companies are used to first launching a version that has a lot of bugs, then optimizing it through a large amount of user feedback. However, when a product that many Deaf people have reported as “incomprehensible” is hastily released under the guise of technological empowerment, it actually harms the Deaf community’s faith in technological solutions. That’s not to mention the fact that some companies mislead users by promoting their products using real humans rather than avatars, and then release an immature generative AI version. Techno-optimists may believe that these flaws will all be solved with time, but we shouldn’t ignore the irreversible ethical harm: If the real needs of Deaf users are not responded to, they’ll feel that they’re being treated as guinea pigs.

It’s true that AI and ethical risks have always gone hand in hand, but these risks are especially prominent in sign language avatars. The worrying quality of AI-generated sign language directly infringes upon Deaf people’s right to access information, pollutes the sign language corpus, and hinders the promotion and popularization of genuine sign language among the Deaf community.

“Nothing about us without us” was the motto of Disabled People’s International when it was founded in 1981, and the warning it offers is as relevant today as it was then. Responsible developers need to do much more than hire a few token Deaf people. They must integrate our perspectives into the entire process of R&D, evaluation, application, and supervision — only then can avatar technology truly serve Deaf people and gain our trust.

Translator: David Ball; editor: Cai Yineng.

(Header image: A screenshot shows a sign language avatar translating a presenter’s speech during the Beijing Winter Olympics, February 2022. From @央视新闻 on Weibo)

Read Entire Article