Audio formats | Kotoba Technologies

The ASR, STS, and TTS APIs all speak JSON over WebSocket, and audio payloads are Base64-encoded inside event bodies.

Input audio (ASR / STS)

Format	Description
`pcm16`	16-bit signed little-endian PCM
`float32`	32-bit IEEE float PCM
`twilio`	μ-law 8 kHz mono (Twilio Media Streams)
`ogg/opus`	Ogg-wrapped Opus

Sample rate defaults to 24,000 Hz and mono (1 channel). You can change both via the session update event.

Output audio (STS / TTS)

Format	Description
`pcm16`	16-bit signed little-endian PCM
`float32`	32-bit IEEE float PCM
`twilio`	μ-law 8 kHz mono (Twilio Media Streams)

Chunk size

Each input_audio_buffer.append event can carry up to 1 MiB (approximately 10 seconds at the default sample rate). For most real-time applications, 20–40 ms chunks work well.