Lost in AI transcription: Adult words creep into YouTube children’s videos
HOW DOES “beach” become “bitch”, “buster” turn into “bastard” or “combo” morph into “condom”?
It happens when Google Speech-To-Text and Amazon Transcribe, both popular automatic speech recognition (ASR) systems, erroneously give such age-inappropriate subtitles on YouTube videos for children.
This is the key finding of a study titled ‘Beach to bitch: Inadvertent Unsafe Transcription of Kids Content on YouTube’ which covered 7,013 videos from 24 YouTube channels.
Ten per cent of these videos contained at least one “highly inappropriate taboo word” for children, says US-based Ashique KhudaBukhsh, an assant professor at Rochester Institute of Technology’s software engineering department.
KhudaBukhsh, assant professor Sumeet Kumar of Indian School of Business in Hyderabad and Krithika Ramesh of Manipal University, who conducted the study, have termed the phenomenon “inappropriate content hallucination”.
“We were mind-boggled because we knew that these channels were watched millions of children. We understand this is an important problem because it is telling us that the inappropriate content may not be present in the source but it can be introduced a downstream AI (Artificial Intelligence) application. So on the broader philosophical level, people generally have checks and balances for the source, but now we have to be more vigilant about having checks and balances if an AI application modifies the source. It can inadvertently introduce inappropriate content,” KhudaBukhsh, who has a PhD in machine learning and is from Kalyani in West Bengal, told The Sunday Express.
Inappropriate content hallucination was found in channels with millions of views and subscribers, including Sesame Street, Ryan’s World, Barbie, Moonbug Kid and Fun Kids Planet, according to the study.
The closed captions on YouTube videos are generated Google Speech-To-Text while Amazon Transcribe is a top commercial ASR system. Creators can use Amazon Transcribe to embed subtitles in their videos and import them into YouTube when uploading the file.
The study was presented and accepted at the 36th annual conference of the Association for the Advancement of Artificial Intelligence in Vancouver in February.
“These patterns tell us that whenever you have a machine language model trying to predict something, the predictions are influenced on what kind of data it is trained on. Most likely it is possible they don’t have enough examples of kid speech or ba talk in the data they are trained on,” KhudaBukhsh said.
The study points out that most English language subtitles are disabled on the YouTube Kids App but the same videos can be watched with subtitles on YouTube.
“It is unclear how often kids are only confined to the YouTube Kids app while watching videos and how frequently parents (or guardians) simply let them watch kids’ content from general YouTube. Our findings indicate a need for tighter integration between YouTube general and YouTube kids to be more vigilant about kids’ safety,” the study states.
When asked about the accuracy of its automatic captions, a YouTube spokesperson said in a statement: “YouTube Kids delivers enriching and entertaining content for kids and is our recommended experience for children under 13. Automatic captions are not available on YouTube Kids, however, our caption tools on our main YouTube site allow channels to reach a wide audience and improve accessibility for everyone on YouTube. We are continually working to improve automatic captions and reduce errors.”
Another example of a misinterpreted word in one of the popular videos goes like this: “You should also find porn.” The actual dialogue ended with “corn”.
KhudaBukhsh said these errors could be due to the data fed to ASR systems during training. “See ‘I love porn’ is a more likely sentence than ‘I love corn’ when two adults have a conversation. One of the reasons some of these adult words are creeping into transcription is because maybe the ASR are trained more on speech examples coming from adults,” he said.
KhudaBukhsh said introducing a human element into the transcription process could be one of the ways to stop these inappropriate words from being telecast to millions of young viewers. “We can have a human in the loop to check on transcription errors. We can have someone watch and manually confirm if it is there in the video or not,” he said.
This is not the first time KhudaBukhsh is flagging the fallibility of AI systems. Last year, he and a student conducted a six-week experiment which showed that words like ‘black’, ‘white’ and ‘attack’ — common to those commenting on chess — could possibly fool an AI system into flagging certain chess conversations as rac. This was shortly after Agadmator, a popular YouTube chess channel with over a million subscribers, got blocked for not adhering to ‘Community Guidelines’ during a chess telecast.
KhudaBukhsh, who conducted this research at Pittsburgh’s Carnegie Mellon University, had said the findings were an eye-opener to the possible pitfalls of social media companies solely depending on AI to identify and shut down sources of hate speech.