How real-time voice translation works (and why latency matters)

May 12, 2026·6 min read

A plain-English look at the pipeline behind live translated calls, speech recognition, machine translation, voice synthesis, and why sub-second latency is the whole game.

When two people who speak different languages talk through SimulSpeak, it feels like magic: you speak, and a moment later the other person hears you in their language, in a voice that sounds like yours. Under the hood, that experience is a carefully tuned pipeline of three stages, each optimised for one thing above all, speed.

The three stages

Every spoken sentence travels through speech recognition, machine translation, and speech synthesis. First, automatic speech recognition turns your audio into text as you speak. Next, machine translation converts that text into the target language. Finally, text-to-speech voices the translation, and with voice cloning, it does so in a voice resembling yours.

Speech-to-text (ASR): your words become text, streamed word by word rather than sentence by sentence.
Machine translation: the recognised text is translated in context, not word for word.
Text-to-speech (TTS): the translated text is spoken aloud in the other language.

Why latency is everything

A conversation only feels natural if the delay between speaking and being understood stays low. Past roughly one second of added latency, people begin to talk over each other and the rhythm of the call falls apart. That is why real-time systems stream each stage: recognition emits words as it hears them, translation starts before a sentence finishes, and synthesis begins as soon as the first translated words are ready.

SimulSpeak targets sub-second added latency measured on live calls, which is what lets a translated call feel like an ordinary one rather than a walkie-talkie exchange.

Keeping your own voice

Hearing a translation in a flat, robotic voice breaks the human connection. By cloning the speaker's voice, the translated audio carries the same tone and identity, so the person on the other end feels like they are talking to you, not to a machine.

Put together, these pieces turn three separate AI systems into a single, fluid conversation across any of 24 languages.

Glossary terms in this article

Real-time translation Latency Speech-to-text (ASR)Text-to-speech (TTS)Voice cloning

Keep reading

Try a translated call

Sub-second, in your own voice, across 24 languages. No app for the other side to install.

View pricing How it works