Python SDK — s2st | Kotoba Technologies

Source code, releases, and changelog.

Install

$ pip install kotoba-sdk

Requires Python 3.10 or later. The package is imported as kotoba. For live-microphone examples, install with the optional mic extra (pulls in sounddevice, which needs PortAudio on the system):

$ pip install 'kotoba-sdk[mic]'

Configure

KotobaClient() reads its credentials and per-route URLs from environment variables. For S2ST you need:

Variable	Purpose
`KOTOBA_API_KEY`	Bearer token sent on WS requests
`KOTOBA_S2ST_EN_JA_URL`	WebSocket URL for English → Japanese speech translation

Or pass them in code:

1 import kotoba
2 
3 client = kotoba.KotobaClient(
4     api_key="kotoba-...",
5     s2st_en_ja_ws_url="wss://.../sts",
6 )

To use other language pairs, register them at runtime:

1 kotoba.register_endpoint("s2st", "en", "ko", "wss://.../sts-en-ko")

One-shot translation

translate(...) consumes a finished audio file and returns the translated audio plus the source-side transcript:

1 import kotoba
2 
3 client = kotoba.KotobaClient()
4 result = client.s2st.translate("clip.mp3", src="en", tgt="ja")
5 result.to_wav("translated.wav")
6 print("source transcript:", result.transcript_source)

Streaming translation

Use client.s2st.stream(...) for live audio in / live audio out — both transcript deltas and synthesized chunks surface as the server produces them:

1 with client.s2st.stream(src="en", tgt="ja") as session:
2     for chunk in pcm16_chunks_from_mic():
3         session.send_audio(chunk)
4     session.commit()
5 
6     for event in session:
7         if event.type == "partial_transcript":
8             print(event.text, end="", flush=True)
9         elif event.type == "audio_chunk":
10             speaker.write(event.audio)        # float32 PCM @ 24 kHz
11         elif event.type == "done":
12             break

Tuning latency with `delay`

Both stream(...) and translate(...) accept an optional delay parameter — an integer in the range 0–20 that controls how many tokens of context the server buffers before emitting translated audio. Higher values give the model more lookahead (better translation quality); lower values reduce latency. Omit it to keep the server default.

1 with client.s2st.stream(src="en", tgt="ja", delay=10) as session:
2     ...
3 
4 result = client.s2st.translate("clip.mp3", src="en", tgt="ja", delay=10)

Async

Both entry points have async equivalents via AsyncKotobaClient:

1 import asyncio
2 import kotoba
3 
4 async def main() -> None:
5     async with kotoba.AsyncKotobaClient() as client:
6         result = await client.s2st.translate("clip.mp3", src="en", tgt="ja")
7         result.to_wav("translated.wav")
8 
9 asyncio.run(main())

For a runnable mic demo see examples/s2st_mic_async.py in the SDK repo (needs the mic extra — sounddevice — and PortAudio).

What’s in the box (S2ST)

Symbol	What
`client.s2st.translate(path, src, tgt)`	One-shot file → translated WAV + transcript
`client.s2st.stream(src, tgt)`	Bi-directional WebSocket session

See the API reference for the on-the-wire protocol that this SDK wraps.