Audio formats

Supported input and output encodings.

The ASR, STS, and TTS APIs all speak JSON over WebSocket, and audio payloads are Base64-encoded inside event bodies.

Input audio (ASR / STS)

FormatDescription
pcm1616-bit signed little-endian PCM
float3232-bit IEEE float PCM
twilioμ-law 8 kHz mono (Twilio Media Streams)
ogg/opusOgg-wrapped Opus

Sample rate defaults to 24,000 Hz and mono (1 channel). You can change both via the session update event.

Output audio (STS / TTS)

FormatDescription
pcm1616-bit signed little-endian PCM
float3232-bit IEEE float PCM
twilioμ-law 8 kHz mono (Twilio Media Streams)

Chunk size

Each input_audio_buffer.append event can carry up to 1 MiB (approximately 10 seconds at the default sample rate). For most real-time applications, 20–40 ms chunks work well.