Building Real-Time Voice AI: Inside the Infrastructure That Achieves Sub-400ms Human-Like Conversations

Authors

Ali Asgar Saifee

Tushar Goel

Devansh Ghatak

TABLE OF CONTENTS

Regular Item

Selected Item

Last Updated

January 29, 2026

Voice AI is often marketed as a model problem - better STT, more natural TTS, larger LLMs. In practice, this framing is incomplete.

Real-time voice interaction is a systems problem.

Human conversation does not operate sequentially. Listening, thinking, and speaking overlap continuously. Any artificial delay whether in transcription, reasoning, or speech synthesis immediately breaks immersion. Once latency crosses a narrow threshold (roughly 400ms), users no longer perceive the system as conversational. They perceive it as lag.

Most voice AI systems today fail not because their models are weak, but because their infrastructure is not designed for real-time constraints.

This post is a deep dive into how we approached this problem at Simplismart:

Not by tuning one model, but by engineering an end-to-end voice AI pipeline optimized for:

Sub-400ms TTFB (Time-To-First-Byte) at scale
High concurrency at low GPU cost
Stable inter-token latency
Human-like speech flow under real production load

Our Starting Point in Voice AI: Breaking Whisper

Three years ago, our journey into voice AI infrastructure started with a simple but ambitious goal:
make Whisper fast enough for real-time use.

At the time, Whisper was accurate but not designed for production-scale, low-latency voice AI agents. Most implementations treated it as a batch transcription model, unsuitable for live systems.

We decided to break that assumption.

In doing so, we learned something critical:

Optimizing a single model exposes bottlenecks everywhere else in the system.

Once Whisper became fast, the LLM became the bottleneck.
Once the LLM was optimized, TTS latency surfaced.
Once TTS improved, inter-token jitter degraded speech quality.

That experience shaped our core philosophy:

Great voice AI is built by aligning every stage of the pipeline to a single real-time goal along with picking better models.

Defining the Real Goal: What “Human-Like” Actually Means in Voice AI

Before discussing optimizations, it’s important to define the target precisely.

For a voice AI agent to feel human-like, four conditions must hold simultaneously:

Overall latency (Time to First Response) < 400ms
Anything beyond this feels artificial.
High-Quality Transcription
Errors propagate downstream. Poor STT degrades reasoning and speech quality.
Stable Inter-Token Latency (ITL)
Speech must flow naturally. Token jitter creates audible hesitation.
Natural, Expressive Audio Output
Human tone, pauses, and sentence-level coherence matter more than raw synthesis speed.

These are not independent. Improvements in one component often worsen another unless the system is designed holistically.

The Voice AI Pipeline: STT → LLM → TTS (As a Single System)

Most voice stacks treat these as three independent services. We treat them as one coordinated pipeline.

Why This Matters

In a real conversation:

Listening does not block thinking
Thinking does not block speaking
Speaking begins before thought is complete

To replicate this, infrastructure must support:

Parallelism
Streaming boundaries
Dynamic batching
Backpressure-aware scheduling

We optimized each stage not in isolation, but based on how it affects the next.

The rest of this post breaks down how.

Stage 1: Speech-to-Text (STT) Optimized for Sync, Not Streaming

The Industry Assumption

Most voice AI systems attempt to stream STT token-by-token. This sounds intuitive but in practice, it increases complexity and often worsens latency.

Our Observation

In real-world voice AI agents:

User utterances typically arrive in 5–20 second chunks
Average = ~10 seconds
Users rarely speak continuously without pause

Our Approach

Instead of streaming Whisper token-by-token, we optimized it for fast synchronous transcription:

10 seconds of audio transcribed in ~50ms
25+ concurrent streams per GPU
Dynamic batching instead of sequential execution

With batching, what would have taken ~1 second sequentially for 20 audio streams (20 streams x 50ms per stream) collapses to ~90–150ms under load.

This gives us a predictable, low-latency transcription stage without unnecessary streaming overhead.

Latency Contribution (STT)

Input audio window: ~10 seconds (user speech, unavoidable)
Transcription latency: ~50 ms (single stream)
Batched latency under load: ~90–150 ms (25+ concurrent streams)

Effective contribution to end-to-end latency:
➡ ~120 ms, depending on load

Stage 2: LLMs - Where Latency Is Won or Lost

Once transcription is fast, the LLM becomes the dominant latency contributor.

The Hidden Cost: Context

Most voice AI agents operate with:

2,000–4,000 tokens of fixed system context (playbooks, rules, persona)
50–60 tokens of user input per turn

In a straightforward implementation, this would mean recomputing thousands of tokens per request.

The Key Optimization: Prefix + KV Caching

We treat the system context as immutable:

Cached once
Never recomputed
Shared across calls

As a result:

Only ~100 new tokens are processed per turn
LLM TTFT drops to ~30–40ms
Even at 25 concurrent requests, latency remains stable (~140ms worst case on 4B models)

Inter-Token Latency Matters More Than Throughput

Voice quality depends on how evenly tokens are emitted, not just how fast the first one arrives.

We enforce concurrency caps per GPU to ensure:

Consistent ITL (2–5ms per token)
No mid-sentence stalling
Natural speech cadence

If concurrency exceeds thresholds, we scale horizontally instead of overloading a single GPU.

Here is our detailed approach on autoscaling.

Latency Contribution (LLM)

With prefix + KV caching and concurrency control:

LLM TTFT (single request): ~30–40 ms
LLM TTFT (25 concurrent requests): ~120–140 ms (worst case)
Inter-token latency: ~2–5 ms per token (stable)

Effective contribution to end-to-end latency:
➡ ~140 ms, depending on concurrency

Voice AI Infra: Stage 1 + Stage 2 latency

Stage 3: Text-to-Speech (TTS) - Making Audio Streamable

In real-time voice AI systems, the per-token latency matters as much in TTS as it does in the LLM itself. The faster a sentence (or a meaningful fragment of it) is generated, the sooner it can be handed off to TTS. Any delay here directly compounds end-to-end conversational latency.

Most open-source TTS models are:

Sentence-based
Non-streamable by default

For e.g. models like Orpheus TTS impose an additional constraint: they require a minimum token context before producing stable audio. In practice:

Orpheus waits for ~28 tokens before beginning synthesis.
This corresponds to roughly ~500 ms of voiced audio; fewer tokens lead to unstable or incorrect outputs (or word by word output which sounds robotic)
Internally, Orpheus combines an LLM frontend with a SNAC decoder, converting token sequences into numerical wave representations (numpy-based audio tensors).

Rather than waiting for full sentences or full audio tensors:

Tokens are accumulated in 28-token batches.
Audio is decoded slice-wise (e.g., a few frames at a time).
Playback begins immediately once the first slice is ready, while decoding continues incrementally.

If token generation takes ~1 ms per token, the effective TTS latency becomes tightly coupled to LLM token latency along with sentence length (smaller sentences have lower latency than longer sentences)

This creates a hard latency floor unless the system is explicitly designed to overlap text generation and audio synthesis.

Sentence-Aware Streaming

To eliminate this bottleneck, we introduce a sentence-aware streaming layer:

Text is streamed token-by-token from the LLM.
Sentence boundaries (., ?, !) are detected in real time.
As soon as a partial sentence is semantically coherent, it is dispatched to the TTS engine without waiting for the full response to complete.

This allows TTS to begin work while the LLM is still generating downstream tokens.

Chunked Audio Decoding

On the audio side, we avoid decoding full waveform tensors upfront. Instead:

Audio is decoded in small slices (for example, ~3 frames at a time).
Playback begins immediately after the first slice is ready.
Decoding continues incrementally in parallel with playback.

This approach reduces:

Audio Time-To-First-Byte (TTFB) from ~120–150ms down to ~60ms
Perceived latency, which matters more than raw benchmarks

The result is continuous, human-like speech, rather than delayed, bursty playback that users subconsciously associate with machines.

Latency Contribution (TTS)

Key constraints and mitigations:

Token generation time: ~1 ms/token → ~28 ms gating delay
Audio TTFB (after chunked decoding): ~60 ms

Because sentence-aware streaming overlaps LLM generation with TTS synthesis:

TTS no longer waits for full sentences
Audio decoding begins as soon as the first viable slice is ready

Effective contribution to end-to-end latency:
➡ ~100-120 ms, largely overlapped with LLM generation

Why We Avoid End-to-End Speech-to-Speech Black Boxes

Some voice AI systems bypass STT and TTS entirely by using end-to-end speech-to-speech models. While this approach is attractive on paper with fewer components and simpler pipelines, it introduces fundamental trade-offs that limit real-world usability.

The most important limitation is loss of controllability.

In end-to-end systems, quality, tone, reasoning, and behavior are all tightly coupled to a single model. If the output quality is subpar, the only option is to swap the entire system. If you need better reasoning, different speaking styles, or function calling, you are constrained by whatever capabilities that one model exposes.

This breaks down quickly for real voice AI agents.

Consider a human-like voice AI agent handling a banking or support workflow:

A user asks why their account balance changed.
A human agent would look up records, query systems, and respond with context.
An LLM does the same but only if it can call tools, query APIs, and reason over structured outputs.

End-to-end speech-to-speech models make this extremely difficult:

Tool calling becomes opaque or impossible.
You cannot selectively choose specialized function-calling models (e.g. GPT-OSS, Function Gemma).
Domain logic, retrieval, and reasoning are entangled with speech generation.

Equally important is voice and persona control.

In production systems, voice directly shapes user trust, engagement, and perceived intelligence and thus it is part of the core product experience.

Some use cases require neutral, compliant tones.
Others require expressive or human-like personalities.
Companion or assistant experiences often demand specific vocal styles (for example WizardLM Uncensored).

Black-box speech models force you into a single voice, a single tone, and a single behavior profile.

The Bigger Picture: Voice AI as Infrastructure, Not a Demo

What we’ve described is not a single optimization or a clever trick, it is a systems philosophy.

At scale:

Latency is a product feature and not just a benchmark.
Concurrency is an architectural constraint, not a load test detail.
Quality emerges from coordination, not from any single model in isolation.

Real-time voice AI only works when STT, LLMs, and TTS are treated as a tightly coordinated pipeline where generation, decoding, and playback overlap by design, and every millisecond is accounted for.

You can build this yourself. But doing so requires deep, sustained systems work across:

Kernel-level performance tuning
Dynamic batching and scheduling
Streaming-first model execution
End-to-end latency budgeting

This is the work we’ve been doing for years.

The result is not a demo that works on stage, but production-grade voice AI infrastructure designed to scale, adapt, and feel human under real-world load.