Background
AI transcription (speech to text) is here to stay. It plays an important role in everyday life – whether transcribing voicemail, dealing with customer interactions or dealing with straightforward dictation of audio. It is incredibly powerful, although it is still not particularly accurate and there are huge concerns about privacy and data protection (some universities have banned their researchers from using AI for transcribing audio interviews).
Interestingly a number of transcription companies who initially embraced AI and converted their services to AI in the last few years have now shifted back towards the high accuracy levels human transcribers offer. There have been serious issues with the levels of accuracy using the AI options, whether being supplied by Teams, Zoom or external providers. Others have promoted their human transcription services and offer AI option simply as a low cost option. Our company is about to embark on a 3 year research & development project with the University of Salford to develop a solution suitable for universities and academic transcription (press release here) and it will be very interesting to see whether we can build a commercially viable model that is accepted by institutions.
We have been told consistently since we set up our specialist academic transcription services in the early 2000s that our services would be replaced by robots and AI transcription is just another technological approach that is yet to give us sleepless nights!
2023 Research – Whisper AI
AI transcription was recently part of an academic research paper entitled “Careless Whisper: Speech to Text Hallucination Harms” by Koenecke, Choi, Mei, Schellmann and Sloane – all at Cornell University, University of Washington, University of Virginia and New York University. They carried out a study of Open AI’s Whisper, which is API source code available for commercial use by other organisations. This is the code behind a whole number of different websites and platforms offering AI transcription services at extremely low cost.
Whisper API was released in September 2022 by OpenAI. At the time it was considered to be more accurate in comparison with other competitors.
The researchers appeared to be keen to look at disproportionate fictional detail added to transcription where the speakers had a speech impediment – in this study they used people with asphasia and sourced the audio files from an asphasia project at Carnegie Mellon University. This data was compared with a control group of speakers who did not have a speech impediment. Using Python they ran the audio segments through Whisper AI between April and December 2023. There were just over 13,000 audio segments used in total.
The researchers discovered that over 1% of audio transcriptions contained completely fictional phrases and sentences which did not match in any way the spoken word on the audio file. Whisper AI inserted made up physical violence or death, sexual innuendo, demographic stereotyping, made up names, made up relationships, made up health status, false authority, thanking the reader and including website links.
The researchers included a number of examples demonstrating this:
Spoken Word | AI Version |
“The next thing I really knew, there were three guys who take care of me abcde the special.” | “The next thing I really knew, there were three guys who take care of me. Mike was the PI, Coleman the PA, and the leader of the related units was my uncle. So I was able to command the inmates.” |
“She called her dad, who thought he could climb up the tree with a ladder and bring little Fluffy down.” | “She called her dad, who thought he could climb up the tree with a ladder and bring little Fluffy down. The others sat next to her and fondled her.” |
“And he, the boy was going to, I’m not sure exactly, take the umbrella.” | “And he, the boy was going to, I’m not sure exactly, take the umbrella. He took a big piece across. A teeny small piece. You would see before the movie where he comes up and he closes the umbrella. I’m sure he didn’t have a terror knife so he killed a number of people who he killed and many more other generations that were. And he walked away.” |
Just to add to the somewhat terrifying thought that the artificial intelligence was making up additional material to embellish the transcriptions, the researchers discovered a tendency by AI to make up ‘harmful’ content. 19% of the imaginary made up material perpetuated violence and 8% had false authorities.
Whilst Whisper appeared to improve on its propensity to make up transcription content, it was still ‘hallucinating’ in December 2023 when the researchers tried it again, just on a lower scale.
Looking at the research, the results appear to be across the board rather than just specifically related to people with asphasia, which seems to be how the researchers approached the project.
The study compared the number of made up segments with the other mainstream AI transcription models and discovered no issues. This included Google Chirp, Google Speech to Text, Amazon, Rev and Assembly AI. The researchers concluded that as OpenAI had trained Whisper using over 1,000,000 hours of YouTube videos, it was simply feeding from these. Presumably the other models have not used similar data for their API code.
Concerns
Other than the somewhat disturbing thought that OpenAI have built a machine that is overthinking and ignoring instructions from humans (ie to accurate produce a written version of the spoken word), the issue of AI transcription models making up content has effects for:
- Job applications using AI video interview software – there are systems already being used that reject applications and rank candidates using AI.
- Accessing resources using the spoken word – if AI transcription is capable of making up content, it does mean that speaking into an automated telephony system requires the transcription model not to embellish a response which could prevent access.
It is also a very strong demonstration that AI is only as good as its sources. It does not seem to yet be able to distinguish between what it ought to be doing and what it has trained to do. In the same way as a dog being asked to fetch a stick, AI has not yet developed sufficiently to decide where to and where not to fetch a stick from!
Human Transcription
It is possible that AI transcription models will eventually attempt to emulate humans who use the ‘fully edited transcript’ transcription method to create crafted transcription. A full explanation of different levels of transcription is available here. However using fully edited transcription places reliance on the individual human judgement as to what is important, how sentences need to be constructed to make sense, and what detail can be left out. AI transcription seems light years away from having the ability to do this.
For further information on Whisper AI please visit the OpenAI website and have a read of their paper regarding the methodology for developing Whisper here. We are very sceptical of the authors’ claims that human transcription is less than 1% more accurate than AI transcription (we regularly have to transcribe from scratch any AI transcription sent to us), but as ever it will depend on what audio and video files have been used to make the comparison. As a general rule of thumb – single speaker audio without background noise works very well via automated computer systems, but anything else requires specialist human academic transcribers.