Text to Speech
Synthesize natural speech from text, with streaming output.
Kotoba’s Text-to-Speech (TTS) synthesizes audio from text and streams it back as soon as the model produces it, so you can pipe straight to a speaker, a WebRTC track, or an LLM-driven agent without waiting for the full utterance.
Supported languages: English (en), Japanese (ja), Korean (ko),
Chinese (zh), and Spanish (es).
Three call shapes
- One-shot — pass a full string in, get a complete waveform back.
- Streaming — send the text in one frame and read audio chunks as they emerge from the server (output streaming).
- Async — the same surface, with
AsyncKotobaClientfor production-grade concurrency.
Available Japanese speakers: ja-man-m02-azawa (male) and
ja-woman-f04-me (female).
Where to go next
- Python SDK for TTS —
client.tts.synthesize(...)for one-shot,client.tts.stream(...)for streaming. - API reference — the AsyncAPI spec for the TTS WebSocket channel.