Fixing the faltering AI transcription ecosystem

Graphic to show audio waves

Automated transcription software isn’t exactly new as an idea or a finished product. Most of the notable players in the market, like Otter, Trint and Temi came into prominence over the last ten years, with Rev launching as far back as 2010. All services that fall into this bracket are powered by artificial intelligence (AI), particularly machine learning and natural language processing (NLP), and promise the ability to do away with laborious and time-consuming manual transcription, creating fewer bottlenecks in business.

The need for accurate automated transcription in both the business-to-business (B2B) and business-to-consumer (B2C) worlds – as well as occupations like journalism – has been ever present, but this need isn’t being served as well as it can be. Far from any particular provider bearing the brunt of the guilt, the ecosystem as a whole is underserving customers for various reasons – from inadequacies in the technology to a lack of accessibility.

Picking apart the faults in AI transcription

Much of transcription software’s marketing is directed at businesses, with features that focus on picking out key moments from recordings and enabling collaboration. There’s another area, however, where these tools have chosen to focus their energy: accessibility.

Prior to these tools being available, transcription was often prohibitively expensive. Since Rev burst onto the scene in 2010, primarily focused on being a marketplace for freelancers to provide less expensive transcription than was available, the market has seen a large influx of venture capitalism cash. With newcomers like Airgram and Verbit garnering huge cash injections in initial funding rounds – $10 million and $23 million respectively in 2022 – it’s important the ecosystem begins to listen to the needs of their customers. Initial promises of lower costs, and greater access, seems like a reason to jump for joy – but there’s a catch.

Transcription software is notorious for generating errors, no matter how finely-tuned the AI is. One very high-profile example is on YouTube where the company has come under fire for removing a community captioning feature that allowed creators to lessen the burden of fixing the often clunky transcriptions provided by parent company Google.

Social media abounds with examples of poor transcriptions, all powered by various forms of AI, from the likes of major tech names like Facebook and TikTok. Many of these tools argue the solution is usage; the more diverse the user base, the more accurate the transcriptions are – and vice versa. While there are significant differences between companies, some offer multi-language support while others are far more geared towards English. What’s clear – even after just a cursory look at this segment of the market – is that it isn’t as open and shut as marketing would have you believe.

AI transcription’s accessibility problem


What is contextual analytics?

Creating more customer value in HR software applications


Improper captions are one reason why Svetlana Kouznetsova, a New York-based B2B accessibility strategy consultant who previously worked in web development, believes AI transcriptions are only fit for purpose when context is understood by all parties. She says people should be wary of deploying such tools in business environments and presuming all will be fine following implementation.

“I do not use auto-captions for work,” Kouznetsova, who has impaired hearing, says. “I use auto-captions mostly for informal conversations because auto-captions are often inaccurate, and it's more than just words. There are many small things to consider that machines cannot do, also it's important that captions are readable and understandable. Even if captions are accurate but not formatted or designed well they are hard to read.”

Kouznetsova says that as tools have rushed to herald their accessibility features, they have seemingly forgotten that context is key when trying to address accessibility in these spaces. She cites Otter, as an example, announcing an integration with Zoom in 2020 to provide live captioning. For her, it’s not just about which words can be spit onto the page, and in which order, but also about how transcriptions feed (or don’t feed) into readability.

“Many people don’t realise that those who can hear can fall back onto their good hearing, if text doesn't make sense to them. However, for us deaf people it's important that text is of good quality, otherwise it causes cognitive dissonance. I'm not against auto-transcribing tools. They may be useful for some situations. But they aren’t the best accessibility solution. It's like hiring a writer to write a book. The writer may do a great job writing a book. But if a book isn’t well designed, it's hard to read.”

Boosting the AI transcription ecosystem

Accurate transcription software might have played a hand reducing the gap between what the IT team thinks customers want and what they’re actually asking for, quips Michelle Symonds, who worked for the CITI Group in various roles, including IT project manager, for more than a decade, before founding her own UK-based SEO company.


Magic quadrant for data quality solutions

Amplifying analytics for better insights and for making trusted, data-driven decisions


“I think one really big issue then, and now, is that stakeholders and end-users make assumptions,” she explains. “IT professionals make assumptions, too, and they're not always the same assumptions. If you could have some record of that it might help to solve that sort of problem. It’s quite simply that IT people and business people, for want of a better word, don't speak the same language; that the IT people are thinking in a different way.”

Although there are problems, the quality of the technology has been improving over time, despite the fact that, for example, much of the software is anglo-centric. Using the technology also requires a significant amount of human intervention to sharpen up transcriptions generated from conversations. There might be hope on the horizon, though, in the form of more advanced technology branded Whisper, developed by OpenAI, which claims to recognise and translate audio at near-human levels. Should the likes of Whisper, or other ventures, live up to their potential, it could raise standards across the entire ecosystem.

John Loeppky is a British-Canadian disabled freelance writer based in Regina, Saskatchewan. His work has appeared for the CBC, FiveThirtyEight, Defector, and a multitude of others. John most often writes about disability, sport, media, technology, and art. His goal in life is to have an entertaining obituary to read.