Text to Speech | Kotoba Technologies

Kotoba’s Text-to-Speech (TTS) synthesizes audio from text and streams it back as soon as the model produces it, so you can pipe straight to a speaker, a WebRTC track, or an LLM-driven agent without waiting for the full utterance.

Supported languages: English (en), Japanese (ja), Korean (ko), Chinese (zh), and Spanish (es).

Three call shapes

One-shot — pass a full string in, get a complete waveform back.
Streaming — send the text in one frame and read audio chunks as they emerge from the server (output streaming).
Async — the same surface, with AsyncKotobaClient for production-grade concurrency.

Available Japanese speakers: ja-man-m02-azawa (male) and ja-woman-f04-me (female).

Where to go next

Python SDK for TTS — client.tts.synthesize(...) for one-shot, client.tts.stream(...) for streaming.
API reference — the AsyncAPI spec for the TTS WebSocket channel.