ElevenLabs and Google dominate AI transcription as word error rates fall to unprecedented levels
Specialized engineering and multimodal AI converge as ElevenLabs and Google set new records for automated transcription accuracy.
March 1, 2026

The landscape of automated speech recognition has reached a significant milestone as ElevenLabs and Google emerge as the primary titans in the latest industry evaluations. In the newly released version of the speech-to-text benchmark from Artificial Analysis, an independent authority on AI performance metrics, a clear hierarchy has formed that places specialized audio startups and diversified tech giants at the forefront of the industry. The results indicate that the gap between human-level transcription and machine capabilities is closing faster than anticipated, with ElevenLabs and Google currently locked in a high-stakes race for technical supremacy.
The centerpiece of this update is the Artificial Analysis Word Error Rate index, which serves as a rigorous, real-world measurement of transcription accuracy.[1] In the latest iteration of the benchmark, ElevenLabs’ newest model, Scribe v2, secured the top position with an unprecedented word error rate of just 2.3 percent. This performance marks a significant improvement over previous industry standards and narrowingly edges out Google’s premier offering, Gemini 3 Pro, which followed closely with an error rate of 2.9 percent. The competition between the two remains exceptionally tight, as Google’s Gemini 3 Flash and ElevenLabs’ older Scribe v1 model also posted scores in the low 3 percent range, effectively creating a dominant tier of performance that few other providers can match.
One of the most notable findings in the updated benchmark is the emergence of Google’s Gemini models as elite transcription engines. Unlike many of its competitors, Google did not specifically train these models exclusively for the task of speech recognition; rather, their high scores are a byproduct of their general multimodal capabilities. This suggests that large-scale foundation models, which are trained on vast and diverse datasets including video and audio, are becoming naturally proficient at understanding spoken language without the need for specialized fine-tuning. For developers and enterprise clients, this represents a major shift in the market, as they can now leverage a single versatile model for both high-end reasoning and industry-leading transcription accuracy.
ElevenLabs, conversely, has reached its leading position through a focused engineering approach specifically tailored to the nuances of audio. Known primarily for its synthetic voice generation, the company’s pivot into speech-to-text with the Scribe series has been remarkably effective. Scribe is designed to handle the messy reality of human speech, which includes background noise, overlapping conversations, and diverse accents. The benchmark data reveals that ElevenLabs’ specialized focus provides a distinct advantage in specific sub-tests, such as the AA-AgentTalk evaluation. This particular test focuses on speech directed at voice assistants, and in this category, ElevenLabs’ Scribe v2 led with a 1.6 percent error rate, followed immediately by Google’s Gemini 3 Pro at 1.7 percent.
The methodology behind these benchmarks reflects a growing demand for performance that translates to real-world usage rather than laboratory conditions. Artificial Analysis calculates its Word Error Rate index by weighting scores across three diverse datasets: AA-AgentTalk, which accounts for half the score; VoxPopuli-Cleaned-AA, representing 25 percent; and Earnings22-Cleaned-AA, making up the final 25 percent.[1][2] These datasets include everything from short commands to long-form financial meetings, ensuring that the top-performing models are capable of maintaining accuracy across a wide spectrum of audio lengths and technical vocabularies. The success of ElevenLabs and Google in these varied environments underscores their reliability for a broad range of enterprise applications, from automated customer support to complex document archival.
While accuracy is the primary metric for many, the benchmark also highlights the ongoing trade-off between quality, speed, and cost. In the speed factor analysis, which measures how many seconds of audio a model can transcribe per second of real-time processing, the market remains highly fragmented. While ElevenLabs and Google lead in pure accuracy, providers like Deepgram and AssemblyAI continue to fight for the "efficiency frontier," offering lower latency and competitive pricing that appeal to developers building real-time applications like live captioning or simultaneous translation. According to the data, ElevenLabs’ Scribe v1 maintains a speed factor of approximately 33.4, while Google’s Chirp 3 model sits at 29.2.[1] These figures are critical for industries where near-instantaneous feedback is a requirement, such as financial trading or emergency dispatch services.
The benchmark results also signal a shifting role for open-source models in the professional ecosystem. OpenAI’s Whisper Large v3, which has long been the gold standard for accessible, high-quality transcription, now finds itself in the middle of the pack. With a word error rate of 4.2 percent in this latest update, it remains a robust and widely used tool, but it is no longer the performance leader. The gap between open-source baselines and the most advanced proprietary models is widening, as companies like ElevenLabs and Google pour massive resources into specialized architectures and curated training sets. However, Whisper continues to serve as an important benchmark for the industry, providing a reliable and cost-effective alternative for users who do not require the cutting-edge accuracy of the leading proprietary APIs.
The implications for the broader AI industry are profound, as the competition between ElevenLabs and Google is likely to accelerate the adoption of voice-first interfaces. As transcription error rates drop below the 3 percent threshold, the friction of interacting with machines via voice essentially disappears. This facilitates a new era of "agentic" AI, where voice assistants can not only transcribe speech but understand the intent and context of a conversation with near-human precision. The updated benchmark suggests that we are moving toward a future where "hallucinations"—a common problem where models mishear words and substitute them with incorrect context—are becoming increasingly rare in the top-performing systems.
Furthermore, the introduction of secondary features like speaker diarization and audio-event tagging has become the new battleground for these leaders. Both ElevenLabs and Google are now integrating features that go beyond simple text conversion. For instance, ElevenLabs’ Scribe can automatically tag non-verbal events such as laughter, applause, or background music with high accuracy, while Google’s ecosystem offers deep integration with its Cloud platform for enterprise security and scalability. These "audio intelligence" features are becoming essential for media companies and developers who need more than just a raw transcript to make their audio content searchable and interactive.
As the industry looks ahead, the continued dominance of ElevenLabs and Google suggests a consolidating market where the barrier to entry for high-accuracy speech recognition is rising. While newer entrants like Mistral, with its Voxtral model, and Alibaba, with Qwen3 ASR, are showing impressive gains, they still face an uphill battle against the massive infrastructure of Google and the focused innovation of ElevenLabs. The latest Artificial Analysis benchmark has clarified the stakes: for those seeking the absolute pinnacle of transcription accuracy, the choice currently rests between the specialized prowess of a rising audio giant and the multimodal breadth of a traditional tech leader. This rivalry will undoubtedly continue to push the boundaries of what is possible in human-computer interaction, making high-fidelity voice technology more accessible and reliable for users around the globe.