Audio & Speech
Voice is one of the most natural interfaces there is, and it’s now practical to build. Audio work has three building blocks — transcription, synthesis, and understanding — which combine into voice agents.
Speech-to-text (STT)
Section titled “Speech-to-text (STT)”Speech-to-text — also called ASR (automatic speech recognition) or just transcription — converts spoken audio into written text. Modern STT models are accurate, multilingual, and cheap.
Two modes:
- Batch — transcribe a complete recording. For meeting notes, call analytics, captioning archives.
- Streaming — transcribe as the person speaks, emitting partial results. Required for live captions and conversational agents.
Accuracy is not uniform. It degrades with background noise, strong accents, overlapping speakers, and domain-specific vocabulary (product names, jargon, codes). Diarization — labeling who spoke — is a separate, harder problem. Plan for imperfect transcripts.
Text-to-speech (TTS)
Section titled “Text-to-speech (TTS)”Text-to-speech synthesizes spoken audio from text. Modern TTS is close to natural, supports many voices and languages, and can stream audio as it’s generated — essential for keeping perceived latency low.
The dimensions that matter: naturalness, latency (especially time to the first audio chunk), and voice selection.
Audio understanding
Section titled “Audio understanding”Beyond transcription, newer models reason about audio directly — tone and emotion, non-speech events (a siren, applause), music, who is speaking. This keeps information that a plain transcript throws away. Useful for call-quality analysis, accessibility, and richer voice agents.
Voice agents
Section titled “Voice agents”A voice agent lets a user talk to an LLM-powered system. The classic design is a pipeline:
The latency budget
Section titled “The latency budget”The hard part of a voice agent is latency. A natural conversation needs a response within a few hundred milliseconds, and every stage spends some of that budget: capturing audio, STT, the LLM (its time to first token), TTS, and playback. They add up fast.
Tactics: stream every stage (don’t wait for a full transcript before starting the LLM; don’t wait for the full response before starting TTS); handle barge-in so a user can interrupt; and get endpointing right — detecting when the user has actually finished speaking.
Speech-to-speech models
Section titled “Speech-to-speech models”Newer speech-to-speech models take audio in and produce audio out directly, skipping the text round-trip. They cut latency and preserve tone and emotion that transcription discards — at the cost of less visibility and control (no transcript to inspect, log, or guardrail mid-pipeline). The pipeline approach remains easier to debug and govern.
Failure modes
Section titled “Failure modes”Errors compound: an STT mistake becomes wrong input to the LLM, which answers confidently about the wrong thing. Voice also removes the ability to proofread — a user can’t see a misheard word. And voice recordings are sensitive personal data; handle them under Data & Privacy.
Key takeaways
Section titled “Key takeaways”Audio work has three blocks: speech-to-text (batch or streaming, accuracy varies with noise and accents), text-to-speech (natural, streamable — and voice cloning needs consent), and audio understanding. A voice agent chains STT → LLM → TTS, and its central challenge is the latency budget — stream every stage and handle interruptions. Speech-to-speech models cut latency and keep tone but sacrifice control. STT errors compound downstream, so design for imperfect transcripts.