Cohere launches open-source Transcribe model to surpass OpenAI Whisper in speech recognition accuracy

The open-source model outperforms OpenAI’s Whisper, providing developers with a more accurate, locally deployable solution for privacy-conscious speech intelligence.

March 27, 2026

Cohere launches open-source Transcribe model to surpass OpenAI Whisper in speech recognition accuracy
Cohere has officially entered the competitive automatic speech recognition market with the release of its first dedicated voice model, a move that signals a significant expansion of the company’s enterprise AI portfolio.[1][2] Known primarily for its large language models and advanced text embeddings, the company has now unveiled an open-source speech-to-text system designed to compete directly with industry standards such as OpenAI’s Whisper. According to the latest performance metrics, this new model, titled Cohere Transcribe, has achieved a new state-of-the-art benchmark, positioning it as a highly accurate and efficient alternative for developers and organizations seeking high-fidelity audio processing.[1] The model’s arrival comes at a time when the demand for reliable, locally deployable transcription tools is surging, particularly in sectors where data privacy and processing speed are paramount.
The release is anchored by impressive performance data from the Hugging Face Open ASR Leaderboard, a standardized evaluation framework that measures word error rate across diverse real-world datasets.[3][4][5][6] Cohere Transcribe currently holds the top position on this leaderboard, recording an average word error rate of 5.42 percent.[1][3][2][5][6][7][4] This figure represents a notable lead over OpenAI’s Whisper Large v3, which stands at 7.44 percent, as well as other prominent models such as ElevenLabs Scribe v2 and Qwen3-ASR.[6][5] This difference in word error rate translates to a roughly 27 percent relative improvement in accuracy over the previously dominant Whisper model.[3] The evaluation encompasses a variety of challenging acoustic environments, including multi-speaker boardroom meetings and diverse international accents, suggesting that the model is robust enough for complex professional use cases where clarity and precision are non-negotiable.
At the core of the model’s success is a sophisticated hybrid architecture that balances acoustic precision with linguistic understanding.[5] Unlike many traditional speech recognition systems that rely solely on pure Transformer architectures, Cohere Transcribe utilizes a Conformer-based encoder-decoder design.[5] This approach combines the strengths of Convolutional Neural Networks, which are exceptionally good at capturing local acoustic features like specific phonemes, with the global contextual capabilities of Transformers.[5] By interleaving these layers, the model can more accurately interpret rapid transitions in sound while maintaining a coherent grasp of the overall sentence structure.[5] This 2-billion-parameter model was trained from scratch with a specific focus on production-grade reliability, rather than being a mere fine-tuning of existing open-weight frameworks.[3]
One of the most significant implications of this release is the focus on accessibility and local deployment. At 2 billion parameters, the model is compact enough to run on consumer-grade GPUs, making it an attractive option for researchers and small-to-medium enterprises that wish to avoid the latency and costs associated with proprietary cloud-based APIs. Because the model is released under an Apache 2.0 license, users can self-host the technology, ensuring that sensitive audio data never has to leave their local infrastructure. This is particularly relevant for industries such as legal services, healthcare, and finance, where strict compliance and data sovereignty regulations often limit the use of third-party cloud services. The ability to achieve world-class accuracy on modest hardware democratizes access to high-end speech intelligence, shifting the power dynamic away from a handful of large-scale cloud providers.
To validate the model's real-world utility beyond automated benchmarks, the developers conducted extensive human evaluations.[3] These pairwise comparisons involved trained reviewers assessing transcripts based on accuracy, coherence, and the correct transcription of proper nouns. In English-language tests, human evaluators preferred Cohere Transcribe over Whisper Large v3 in 64 percent of comparisons.[3][6][5] The model also showed strong performance in Japanese, achieving a 66 percent win rate against Whisper.[6] These human-centric metrics suggest that the model is particularly effective at avoiding the "hallucinations" and formatting errors that frequently plague other speech-to-text systems. Furthermore, the model includes a native 35-second chunking logic, allowing it to process long-form audio, such as hour-long earnings calls or lectures, without the performance degradation typically seen in models that lack stable long-range memory management.[4]
While some competitors have pursued a strategy of supporting hundreds of languages with varying degrees of success, Cohere has opted for a quality-over-quantity approach.[5] The model currently supports 14 major languages, including English, Japanese, Arabic, Mandarin Chinese, and several European languages.[5] This targeted focus ensures that the supported languages receive the highest possible level of attention, resulting in state-of-the-art performance for those linguistic groups rather than a diluted performance across a broader set. However, it is important to note that the current iteration does not include automatic language detection, requiring users to pre-select their language, and it lacks built-in features for speaker separation, also known as diarization.[6] Despite these omissions, the company positions the model as a foundational layer for broader speech intelligence platforms, with plans to integrate it more deeply into its existing agent orchestration systems.[6]
The strategic timing of this release highlights the shifting landscape of the AI industry, where open-source models are increasingly reaching or exceeding the performance of their closed-source counterparts. For years, OpenAI's Whisper has been the default choice for developers building transcription into their products, but the emergence of a more accurate, more efficient, and fully open alternative creates new opportunities for innovation. By offering the model for free as a download on Hugging Face while also providing managed access via its own API, the company is bridging the gap between open-weight research and enterprise-grade reliability. This dual-track approach allows the community to stress-test and improve the model while providing a stable, scalable environment for corporate users.
In the broader context of the AI market, the move marks a pivot for Cohere toward becoming a comprehensive provider of enterprise intelligence. By expanding into audio, the company is addressing the bottleneck that often exists between unstructured voice data and actionable text. The ability to accurately convert speech into text is the critical first step for a wide range of downstream applications, including automated meeting summaries, real-time customer support analytics, and searchable databases of recorded content.[6] As businesses increasingly look to harness the "dark data" contained within their audio recordings, the availability of a high-accuracy, open-source tool like Transcribe could accelerate the development of specialized AI assistants and analytical tools.
Ultimately, the release of this model underscores a growing trend toward specialized, high-performance AI components. Rather than relying on a single monolithic model to handle all tasks, developers are increasingly assembling pipelines of best-in-class tools for specific functions like transcription, translation, and reasoning. The fact that an open-source model has managed to top the benchmarks against well-funded proprietary rivals suggests that the frontier of speech recognition is still expanding rapidly. For the AI industry, the primary takeaway is that transparency and performance can coexist, providing a powerful new toolset for the next generation of voice-enabled applications. As the model continues to be adopted and integrated into various software ecosystems, it is likely to become a new standard for how organizations process and understand the spoken word.

Sources
Share this article