OpenAI has launched Whisper API, a hosted version of its open-source Whisper speech-to-text model, which was released in September 2021. Priced at $0.006 per minute, Whisper provides automatic speech recognition and translation from multiple languages into English.
OpenAI claims the system allows for “robust” transcription across various languages and unique accents, background noise and technical jargon. It accepts various file formats including M4A, MP3, MP4, MPEG, MPGA, WAV and WEBM. The system has been trained on 680,000 hours of multilingual and “multitask” data gathered from the web. The data set has enabled improved recognition of accents and technical jargon. The Whisper API is optimised for convenience and speed.
Enterprise adoption of voice transcription technology has been slowed by several barriers including cost, accent- or dialect-related recognition issues, and accuracy. OpenAI acknowledges the limitations of Whisper, which include next-word prediction, and warns that it might transcribe words that were not spoken due to its attempt to predict the next word in the audio recording and transcribe it. Additionally, the system does not perform equally well across languages and has a higher error rate for underrepresented languages.
OpenAI sees Whisper’s transcription capabilities as being useful for improving existing products and tools. The AI-powered language learning app Speak is already using the Whisper API to power a new in-app virtual speaking companion. If OpenAI is able to break into the speech-to-text market, it could be highly profitable for the Microsoft-backed company. The segment is projected to be worth $5.4bn by 2026, up from $2.2bn in 2021.
OpenAI’s goal is to become a “universal intelligence” that can take in any kind of data and perform any task. While the company acknowledges the issue of bias in speech recognition systems, it is optimistic about the potential for Whisper and believes that it will be a force multiplier for attention.