Text to Speech

Synthesize natural speech from text, with streaming output.

Kotoba’s Text-to-Speech (TTS) synthesizes audio from text and streams it back as soon as the model produces it, so you can pipe straight to a speaker, a WebRTC track, or an LLM-driven agent without waiting for the full utterance.

Supported languages: English (en), Japanese (ja), Korean (ko), Chinese (zh), and Spanish (es).

Three call shapes

  • One-shot — pass a full string in, get a complete waveform back.
  • Streaming — send the text in one frame and read audio chunks as they emerge from the server (output streaming).
  • Async — the same surface, with AsyncKotobaClient for production-grade concurrency.

Available Japanese speakers: ja-man-m02-azawa (male) and ja-woman-f04-me (female).

Where to go next

  • Python SDK for TTSclient.tts.synthesize(...) for one-shot, client.tts.stream(...) for streaming.
  • API reference — the AsyncAPI spec for the TTS WebSocket channel.