Skip to content
A green speech bubble on a background of scattered yellow lined paper, with three crumpled yellow balls inside, suggesting communication or ideas.

Streaming TTS Benchmark Reveals Sub-200ms Latency Leaders

A new streaming text-to-speech benchmark compares AsyncFlow, ElevenLabs, and Cartesia on latency, audio quality, and real-time performance.

In the fast-paced world of real-time voice applications, milliseconds matter. A recent benchmark pitted leading streaming Text-to-Speech (TTS) models against each other to determine the optimal balance between speed and audio fidelity.

The evaluation focused on models from Async Voice API, ElevenLabs, and Cartesia, revealing crucial insights for developers building voice assistants, live translation tools, and other interactive audio experiences.

Understanding Latency in Streaming TTS

Real-time voice products hinge on how quickly speech begins playing. For developers, this means every millisecond counts towards creating a more "alive" and responsive interaction. The benchmark measured two key latency metrics:

  • Model-level latency: The time spent solely on GPU inference, excluding network and application overhead.
  • End-to-end latency: The total time from request initiation to the client receiving the first audio byte (TTFB).

Performance Benchmarks: Speed and Quality

The evaluation employed a rigorous methodology, ensuring identical conditions and multiple runs to account for caching bias. This allowed for a clear comparison of raw model speed and real-world streaming performance.

Model Inference Latency:

This metric isolates the raw computational speed of the TTS model on GPU hardware. AsyncFlow demonstrated exceptional efficiency, achieving near floor-level inference times of approximately 20 ms on L4 GPUs.

The lack of disclosed GPU information for ElevenLabs and Cartesia suggests they might rely on higher-tier hardware, making AsyncFlow's efficiency-to-cost ratio particularly noteworthy.

Streaming Latency Benchmark (End-to-End):

This measured the perceived responsiveness in an application. AsyncFlow delivered audio earliest, with a median TTFB under 200 ms, making it ideal for interactive applications. Eleven Flash showed slightly higher latency but faster total completion, while Cartesia Sonic Turbo lagged significantly behind in both TTFB and throughput.

Subjective Quality Evaluation:

Using a pairwise comparison method with over 20 participants, the models were assessed for naturalness and expressiveness. Eleven Flash v2.5 achieved the highest score, showcasing strong prosody control.

AsyncFlow followed closely, maintaining remarkable consistency and minimal artifacts, offering an excellent quality-to-latency ratio. Cartesia Sonic Turbo received lower preference due to synthetic artifacts and intonation drift.

Why Sub-200 ms Latency Matters

In human conversation, delays exceeding 250-300 ms are noticeable. For streaming TTS, achieving sub-200 ms TTFB is critical for fluid, immediate, and conversational speech. This enables natural turn-taking in voice assistants, low-latency dubbing for live streams, and instant audible feedback for transcriptions, enhancing user experience and accessibility.

AsyncFlow has demonstrated that low latency, solid quality, and cost efficiency can be achieved simultaneously in a streaming TTS engine. While ElevenLabs excels in audio naturalness, AsyncFlow's architectural efficiency and sub-200 ms TTFB position it as a leading choice for developers building real-time, interactive voice systems where perceived responsiveness is paramount.

More about audio software:

10 Free Instruments and Vienna Synchron Player from VSL
Vienna Symphonic Library is offering 10 free virtual instruments powered by the Synchron Player, giving creators access to professional orchestral sounds at no cost.
What Audio Tool Do Creators Still Wish Existed?
A Production Expert discussion explores the audio gear and software professionals wish someone would finally build — revealing gaps in modern workflows.
ESPN’s Murder at The U: A New 30 for 30 Podcast Revisits a College Football Tragedy
The seven-episode series Murder at The U from ESPN’s 30 for 30 Podcasts reconstructs the 2006 murder of Miami Hurricanes star Bryan Pata and the long search for justice, launching February 12, 2026.

Comments

Latest