Python SDK — t2s | Kotoba Technologies

Source code, releases, and changelog.

Install

$ pip install kotoba-sdk

Requires Python 3.10 or later. The package is imported as kotoba.

Configure

KotobaClient() reads its credentials and per-route URLs from environment variables. For TTS you need:

Variable	Purpose
`KOTOBA_API_KEY`	Bearer token sent on WS requests
`KOTOBA_TTS_JA_URL`	WebSocket URL for Japanese TTS, e.g. `wss://.../tts`

Or pass them in code:

1 import kotoba
2 
3 client = kotoba.KotobaClient(
4     api_key="kotoba-...",
5     tts_ja_ws_url="wss://.../tts",
6 )

To use other voices / languages, register the route:

1 kotoba.register_endpoint("tts", None, "ko", "wss://.../tts-ko")

One-shot synthesis

1 import kotoba
2 
3 client = kotoba.KotobaClient()
4 audio = client.tts.synthesize("こんにちは、世界。", language="ja")
5 audio.to_wav("hello.wav")

audio.to_wav() converts the underlying float32 24 kHz mono signal to a playable 16-bit WAV.

Available Japanese speakers: ja-man-m02-azawa (male, default) and ja-woman-f04-me (female). Pass speaker_id=... to override:

1 audio = client.tts.synthesize(
2     "こんにちは、世界。",
3     language="ja",
4     speaker_id="ja-woman-f04-me",
5 )

Streaming synthesis

The full text is sent in a single frame; the server streams the synthesized audio back chunk-by-chunk, so you can play (or pipe to a speaker / WebRTC track) without waiting for the utterance to finish:

1 with client.tts.stream(language="ja") as session:
2     session.synthesize("こんにちは。本日はよろしくお願いします。")
3 
4     for event in session:
5         if event.type == "audio_chunk":
6             handle(event.audio)               # float32 PCM @ 24 kHz
7         elif event.type == "done":
8             break

synthesize_stream(...) flattens the loop when you only want PCM bytes:

1 for pcm in client.tts.synthesize_stream("こんにちは、世界。", language="ja"):
2     speaker.write(pcm)

Async

1 import asyncio
2 import kotoba
3 
4 async def main() -> None:
5     async with kotoba.AsyncKotobaClient() as client:
6         async with client.tts.stream(language="ja") as session:
7             await session.synthesize("こんにちは。")
8             async for event in session:
9                 if event.type == "audio_chunk":
10                     await play(event.audio)
11                 elif event.type == "done":
12                     break
13 
14 asyncio.run(main())

What’s in the box (TTS)

Symbol	What
`client.tts.synthesize(text, language)`	One-shot synthesis → `AudioResult`
`client.tts.stream(language)`	Streaming session you drive manually
`client.tts.synthesize_stream(text, lang)`	Single text → audio-chunk iterator

See the API reference for the on-the-wire protocol that this SDK wraps.