Python SDK — s2st

Speech-to-speech translation from Python, WebSocket streaming.

Install

$pip install kotoba-sdk

Requires Python 3.10 or later. The package is imported as kotoba. For live-microphone examples, install with the optional mic extra (pulls in sounddevice, which needs PortAudio on the system):

$pip install 'kotoba-sdk[mic]'

Configure

KotobaClient() reads its credentials and per-route URLs from environment variables. For S2ST you need:

VariablePurpose
KOTOBA_API_KEYBearer token sent on WS requests
KOTOBA_S2ST_EN_JA_URLWebSocket URL for English → Japanese speech translation

Or pass them in code:

1import kotoba
2
3client = kotoba.KotobaClient(
4 api_key="kotoba-...",
5 s2st_en_ja_ws_url="wss://.../sts",
6)

To use other language pairs, register them at runtime:

1kotoba.register_endpoint("s2st", "en", "ko", "wss://.../sts-en-ko")

One-shot translation

translate(...) consumes a finished audio file and returns the translated audio plus the source-side transcript:

1import kotoba
2
3client = kotoba.KotobaClient()
4result = client.s2st.translate("clip.mp3", src="en", tgt="ja")
5result.to_wav("translated.wav")
6print("source transcript:", result.transcript_source)

Streaming translation

Use client.s2st.stream(...) for live audio in / live audio out — both transcript deltas and synthesized chunks surface as the server produces them:

1with client.s2st.stream(src="en", tgt="ja") as session:
2 for chunk in pcm16_chunks_from_mic():
3 session.send_audio(chunk)
4 session.commit()
5
6 for event in session:
7 if event.type == "partial_transcript":
8 print(event.text, end="", flush=True)
9 elif event.type == "audio_chunk":
10 speaker.write(event.audio) # float32 PCM @ 24 kHz
11 elif event.type == "done":
12 break

Tuning latency with delay

Both stream(...) and translate(...) accept an optional delay parameter — an integer in the range 020 that controls how many tokens of context the server buffers before emitting translated audio. Higher values give the model more lookahead (better translation quality); lower values reduce latency. Omit it to keep the server default.

1with client.s2st.stream(src="en", tgt="ja", delay=10) as session:
2 ...
3
4result = client.s2st.translate("clip.mp3", src="en", tgt="ja", delay=10)

Async

Both entry points have async equivalents via AsyncKotobaClient:

1import asyncio
2import kotoba
3
4async def main() -> None:
5 async with kotoba.AsyncKotobaClient() as client:
6 result = await client.s2st.translate("clip.mp3", src="en", tgt="ja")
7 result.to_wav("translated.wav")
8
9asyncio.run(main())

For a runnable mic demo see examples/s2st_mic_async.py in the SDK repo (needs the mic extra — sounddevice — and PortAudio).

What’s in the box (S2ST)

SymbolWhat
client.s2st.translate(path, src, tgt)One-shot file → translated WAV + transcript
client.s2st.stream(src, tgt)Bi-directional WebSocket session

See the API reference for the on-the-wire protocol that this SDK wraps.