Audio formats
Supported input and output encodings.
The ASR, STS, and TTS APIs all speak JSON over WebSocket, and audio payloads are Base64-encoded inside event bodies.
Input audio (ASR / STS)
Sample rate defaults to 24,000 Hz and mono (1 channel). You can
change both via the session update event.
Output audio (STS / TTS)
Chunk size
Each input_audio_buffer.append event can carry up to 1 MiB
(approximately 10 seconds at the default sample rate). For most
real-time applications, 20–40 ms chunks work well.