Speech-to-text (ASR)

Automatic speech recognition that converts spoken audio into written text.

Speech-to-text, or automatic speech recognition (ASR), turns an audio stream into transcribed words. It is the first stage of a translation pipeline: the spoken source language is recognised before it can be translated.

Streaming ASR emits words as they are heard rather than waiting for the end of a sentence, which is essential for low-latency, real-time translation.