Infrastructure
Building Real-Time Voice AI: Inside the Infrastructure That Achieves Sub-400ms Human-Like Conversations
TL;DR Building human-like Voice AI isn’t just about better models, it’s about infrastructure. Simplismart achieves sub-400ms real-time voice latency by treating STT, LLMs, and TTS as a single coordinated system. Fast batched STT, KV-cached LLM inference with stable inter-token latency, and sentence-aware, chunked TTS streaming together enable natural, low-latency conversations at scale. Voice quality, tool calling, and persona control remain modular making this production-grade Voice AI infrastructure.
TABLE OF CONTENTS
Regular Item
Selected Item
Last Updated
January 29, 2026

Voice AI is often marketed as a model problem - better STT, more natural TTS, larger LLMs. In practice, this framing is incomplete.


Real-time voice interaction is a systems problem.


Human conversation does not operate sequentially. Listening, thinking, and speaking overlap continuously. Any artificial delay whether in transcription, reasoning, or speech synthesis immediately breaks immersion. Once latency crosses a narrow threshold (roughly 400ms), users no longer perceive the system as conversational. They perceive it as lag.


Most voice AI systems today fail not because their models are weak, but because their infrastructure is not designed for real-time constraints.


This post is a deep dive into how we approached this problem at Simplismart:

Not by tuning one model, but by engineering an end-to-end voice AI pipeline optimized for:

  • Sub-400ms TTFB (Time-To-First-Byte) at scale
  • High concurrency at low GPU cost
  • Stable inter-token latency
  • Human-like speech flow under real production load


Our Starting Point in Voice AI: Breaking Whisper


Three years ago, our journey into voice AI infrastructure started with a simple but ambitious goal:
make Whisper fast enough for real-time use.


At the time, Whisper was accurate but not designed for production-scale, low-latency voice AI agents. Most implementations treated it as a batch transcription model, unsuitable for live systems.


We decided to break that assumption.


In doing so, we learned something critical:


Optimizing a single model exposes bottlenecks everywhere else in the system.


Once Whisper became fast, the LLM became the bottleneck.
Once the LLM was optimized, TTS latency surfaced.
Once TTS improved, inter-token jitter degraded speech quality.


That experience shaped our core philosophy:


Great voice AI is built by aligning every stage of the pipeline to a single real-time goal along with picking better models.


Defining the Real Goal: What “Human-Like” Actually Means in Voice AI


Before discussing optimizations, it’s important to define the target precisely.


For a voice AI agent to feel human-like, four conditions must hold simultaneously:

  1. Overall latency (Time to First Response) < 400ms
    Anything beyond this feels artificial.
  2. High-Quality Transcription
    Errors propagate downstream. Poor STT degrades reasoning and speech quality.
  3. Stable Inter-Token Latency (ITL)
    Speech must flow naturally. Token jitter creates audible hesitation.
  4. Natural, Expressive Audio Output
    Human tone, pauses, and sentence-level coherence matter more than raw synthesis speed.


These are not independent. Improvements in one component often worsen another unless the system is designed holistically.


The Voice AI Pipeline: STT → LLM → TTS (As a Single System)


Most voice stacks treat these as three independent services. We treat them as one coordinated pipeline.


Why This Matters


In a real conversation:

  • Listening does not block thinking
  • Thinking does not block speaking
  • Speaking begins before thought is complete


To replicate this, infrastructure must support:

  • Parallelism
  • Streaming boundaries
  • Dynamic batching
  • Backpressure-aware scheduling


We optimized each stage not in isolation, but based on how it affects the next.


The rest of this post breaks down how.


Stage 1: Speech-to-Text (STT) Optimized for Sync, Not Streaming


The Industry Assumption


Most voice AI systems attempt to stream STT token-by-token. This sounds intuitive but in practice, it increases complexity and often worsens latency.


Our Observation


In real-world voice AI agents:

  • User utterances typically arrive in 5–20 second chunks
  • Average = ~10 seconds
  • Users rarely speak continuously without pause

Our Approach


Instead of streaming Whisper token-by-token, we optimized it for fast synchronous transcription:

  • 10 seconds of audio transcribed in ~50ms
  • 25+ concurrent streams per GPU
  • Dynamic batching instead of sequential execution


With batching, what would have taken ~1 second sequentially for 20 audio streams (20 streams x 50ms per stream) collapses to ~90–150ms under load.


This gives us a predictable, low-latency transcription stage without unnecessary streaming overhead.


Latency Contribution (STT)
  • Input audio window: ~10 seconds (user speech, unavoidable)
  • Transcription latency: ~50 ms (single stream)
  • Batched latency under load: ~90–150 ms (25+ concurrent streams)


Effective contribution to end-to-end latency:

~120 ms, depending on load

Voice AI Infra: Stage 1 latency
Voice AI Infra: Stage 1 latency


Stage 2: LLMs - Where Latency Is Won or Lost


Once transcription is fast, the LLM becomes the dominant latency contributor.


The Hidden Cost: Context


Most voice AI agents operate with:

  • 2,000–4,000 tokens of fixed system context (playbooks, rules, persona)
  • 50–60 tokens of user input per turn


In a straightforward implementation, this would mean recomputing thousands of tokens per request.


The Key Optimization: Prefix + KV Caching


We treat the system context as immutable:

  • Cached once
  • Never recomputed
  • Shared across calls


As a result:

  • Only ~100 new tokens are processed per turn
  • LLM TTFT drops to ~30–40ms
  • Even at 25 concurrent requests, latency remains stable (~140ms worst case on 4B models)

Inter-Token Latency Matters More Than Throughput


Voice quality depends on how evenly tokens are emitted, not just how fast the first one arrives.


We enforce concurrency caps per GPU to ensure:

  • Consistent ITL (2–5ms per token)
  • No mid-sentence stalling
  • Natural speech cadence


If concurrency exceeds thresholds, we scale horizontally instead of overloading a single GPU.


Here is our detailed approach on autoscaling.


Latency Contribution (LLM)


With prefix + KV caching and concurrency control:

  • LLM TTFT (single request): ~30–40 ms
  • LLM TTFT (25 concurrent requests): ~120–140 ms (worst case)
  • Inter-token latency: ~2–5 ms per token (stable)


Effective contribution to end-to-end latency:

~140 ms, depending on concurrency

Voice AI Infra: Stage 1 + Stage 2 latency
Voice AI Infra: Stage 1 + Stage 2 latency


Stage 3: Text-to-Speech (TTS) - Making Audio Streamable


In real-time voice AI systems, the per-token latency matters as much in TTS as it does in the LLM itself. The faster a sentence (or a meaningful fragment of it) is generated, the sooner it can be handed off to TTS. Any delay here directly compounds end-to-end conversational latency.


Most open-source TTS models are:

  • Sentence-based
  • Non-streamable by default


For e.g.
models like Orpheus TTS impose an additional constraint: they require a minimum token context before producing stable audio. In practice:

  • Orpheus waits for ~28 tokens before beginning synthesis.
  • This corresponds to roughly ~500 ms of voiced audio; fewer tokens lead to unstable or incorrect outputs (or word by word output which sounds robotic)
  • Internally, Orpheus combines an LLM frontend with a SNAC decoder, converting token sequences into numerical wave representations (numpy-based audio tensors).


Rather than waiting for full sentences or full audio tensors:

  • Tokens are accumulated in 28-token batches.
  • Audio is decoded slice-wise (e.g., a few frames at a time).
  • Playback begins immediately once the first slice is ready, while decoding continues incrementally.


If token generation takes ~1 ms per token, the effective TTS latency becomes tightly coupled to LLM token latency along with sentence length (smaller sentences have lower latency than longer sentences)


This creates a hard latency floor unless the system is explicitly designed to overlap text generation and audio synthesis.


Sentence-Aware Streaming


To eliminate this bottleneck, we introduce a sentence-aware streaming layer:

  • Text is streamed token-by-token from the LLM.
  • Sentence boundaries (., ?, !) are detected in real time.
  • As soon as a partial sentence is semantically coherent, it is dispatched to the TTS engine without waiting for the full response to complete.


This allows TTS to begin work while the LLM is still generating downstream tokens.


Chunked Audio Decoding


On the audio side, we avoid decoding full waveform tensors upfront. Instead:

  • Audio is decoded in small slices (for example, ~3 frames at a time).
  • Playback begins immediately after the first slice is ready.
  • Decoding continues incrementally in parallel with playback.


This approach reduces:

  • Audio Time-To-First-Byte (TTFB) from ~120–150ms down to ~60ms
  • Perceived latency, which matters more than raw benchmarks


The result is continuous, human-like speech, rather than delayed, bursty playback that users subconsciously associate with machines.


Latency Contribution (TTS)


Key constraints and mitigations:

  • Token generation time: ~1 ms/token → ~28 ms gating delay
  • Audio TTFB (after chunked decoding): ~60 ms


Because sentence-aware streaming overlaps LLM generation with TTS synthesis:

  • TTS no longer waits for full sentences
  • Audio decoding begins as soon as the first viable slice is ready


Effective contribution to end-to-end latency:

~100-120 ms, largely overlapped with LLM generation

Voice AI Infra: End-to-end latency
Voice AI Infra: End-to-end latency


Why We Avoid End-to-End Speech-to-Speech Black Boxes


Some voice AI systems bypass STT and TTS entirely by using end-to-end speech-to-speech models. While this approach is attractive on paper with fewer components and simpler pipelines, it introduces fundamental trade-offs that limit real-world usability.


The most important limitation is loss of controllability.


In end-to-end systems, quality, tone, reasoning, and behavior are all tightly coupled to a single model. If the output quality is subpar, the only option is to swap the entire system. If you need better reasoning, different speaking styles, or function calling, you are constrained by whatever capabilities that one model exposes.


This breaks down quickly for real voice AI agents.


Consider a human-like voice AI agent handling a banking or support workflow:

  • A user asks why their account balance changed.
  • A human agent would look up records, query systems, and respond with context.
  • An LLM does the same but only if it can call tools, query APIs, and reason over structured outputs.


End-to-end speech-to-speech models make this extremely difficult:

  • Tool calling becomes opaque or impossible.
  • You cannot selectively choose specialized function-calling models (e.g. GPT-OSS, Function Gemma).
  • Domain logic, retrieval, and reasoning are entangled with speech generation.


Equally important is voice and persona control.


In production systems, voice directly shapes user trust, engagement, and perceived intelligence and thus it is part of the core product experience.

  • Some use cases require neutral, compliant tones.
  • Others require expressive or human-like personalities.
  • Companion or assistant experiences often demand specific vocal styles (for example WizardLM Uncensored).


Black-box speech models force you into a single voice, a single tone, and a single behavior profile.


The Bigger Picture: Voice AI as Infrastructure, Not a Demo


What we’ve described is not a single optimization or a clever trick, it is a systems philosophy.


At scale:

  • Latency is a product feature and not just a benchmark.
  • Concurrency is an architectural constraint, not a load test detail.
  • Quality emerges from coordination, not from any single model in isolation.


Real-time voice AI only works when STT, LLMs, and TTS are treated as a tightly coordinated pipeline where generation, decoding, and playback overlap by design, and every millisecond is accounted for.


You can build this yourself. But doing so requires deep, sustained systems work across:

  • Kernel-level performance tuning
  • Dynamic batching and scheduling
  • Streaming-first model execution
  • End-to-end latency budgeting


This is the work we’ve been doing for years.


The result is not a demo that works on stage, but production-grade voice AI infrastructure designed to scale, adapt, and feel human under real-world load.


Contact us
to make your voice agents sound more human.

Find out what is tailor-made inference for you.