Skip to content

Voice AI

Building a Voice Agent: STT, LLM and TTS Under 800ms

By Niall · 6 min read

A voice agent that lags feels broken. Here's how to keep the whole STT-to-TTS round-trip under 800ms.

Share

A voice agent that works feels like magic. A voice agent that lags feels broken. The difference usually comes down to a single number: the time between someone finishing speaking and the agent starting to reply. Get that under about 800 milliseconds and conversation feels natural. Miss it, and every exchange feels like a bad phone line.

Here is how the pipeline fits together, and the engineering details that decide whether it feels human or frustrating.

The three-stage pipeline

At its core, a voice agent is three components in a row. Speech-to-text turns what the caller says into text. A language model decides how to respond. Text-to-speech turns that response back into audio. The art is in making those three stages feel like one seamless conversation rather than three separate steps.

A typical setup streams speech-to-text such as Scribe v2 Realtime or Deepgram into a language model, then streams the model's reply into low-latency text-to-speech such as Cartesia or ElevenLabs Flash. The word 'streams' is doing a lot of work there, and it is the key to hitting the latency budget.

Each stage is a specialised tool, and you will usually buy them from different providers rather than build them. The engineering value you add is in the orchestration: how the pieces hand off to each other, how you handle the messy timing of real speech, and how you keep the whole thing within budget. That orchestration layer, not any single model, is where most of the real engineering effort goes.

The 800ms budget

Natural conversation depends on keeping the full round-trip, from the moment the caller stops speaking to the moment the agent starts replying, under roughly 800 milliseconds. That is not much time to detect end of speech, run recognition, think, and begin speaking. Every stage has to be fast, and they have to overlap rather than wait politely for each other.

It helps to break the budget down. End-of-speech detection, final transcription, the model's first token, and the first audio out each take a slice, and they add up alarmingly fast. Treating the 800 milliseconds as a budget you allocate across stages, rather than a vague aspiration, is what keeps the design honest. Shave a little from each stage and the target is reachable; let any one of them sprawl and the whole conversation feels sluggish.

Why streaming changes everything

If you run the stages one after another, waiting for each to finish, the delays stack up and you blow the budget. Streaming lets them overlap. Transcription emits partial results while the person is still talking. The language model can begin forming a response before the final word lands. Text-to-speech can start speaking the first part of an answer while the rest is still being generated.

This overlap is what makes the difference between a budget you can hit and one you cannot. It is also what makes a voice agent meaningfully harder to build than a text chatbot, where a second of delay is barely noticed.

Streaming also changes how you handle the model. Rather than waiting for a complete answer, you take the first sentence and send it to be spoken immediately while the rest is still being written. The caller hears a reply beginning almost at once, which is most of what makes an exchange feel responsive.

A text chatbot can pause to think and no one minds. In voice, a pause reads as a problem. The engineering challenge is not just being correct, it is being correct fast enough that the silence never feels wrong.

The details that make or break it

Beyond the three main stages, a handful of details separate a usable voice agent from an irritating one. Each sounds minor and is anything but.

  • End-of-speech detection: knowing quickly and accurately when the person has actually finished, not just paused.
  • Partial results: using interim transcripts so work starts before the caller stops.
  • Barge-in handling: letting the caller interrupt, and having the agent stop talking and listen at once.
  • Guardrails: keeping the agent on topic, within policy, and safe when it is unsure.

Getting end-of-speech detection right is a balancing act in itself. Cut in too early and you talk over someone who was merely pausing for breath; wait too long and the conversation drags. There is no universal setting, which is why tuning against real calls, not just clean test audio, matters so much.

Barge-in is especially telling. In real conversation, people interrupt constantly. An agent that ploughs on talking over someone feels robotic and rude, so detecting interruption and yielding gracefully is essential, not a nice-to-have.

Where voice agents earn their keep

When the pipeline is right, the use cases are compelling, particularly for high-volume, structured conversations that follow a recognisable shape.

  • Support: answering common questions and resolving routine issues by phone.
  • Scheduling: booking, rescheduling and confirming appointments.
  • Intake: gathering information at the start of a process, accurately and patiently.

The pattern across all three is a bounded conversation with a clear goal, plus a clean way to escalate. The agent should know its limits and hand off to a person the moment a call goes beyond them. An agent that fails gracefully earns far more trust than one that bluffs through a question it cannot answer.

A good voice agent is a real-time systems problem as much as an AI problem, and the latency budget is unforgiving. Designing that pipeline, with the guardrails and human handoff that keep it trustworthy, is precisely the kind of production AI agent work we take on.

Charleston waterway at sunset with palmetto silhouettes

Get in touch

Have a project in mind? Let's talk.

If this is relevant to what you're building, a short email is the fastest way to get practical help.