Speech to Speech Translation

Translate spoken audio into spoken audio in another language.

Kotoba’s Speech-to-Speech translation (S2ST) ingests audio in one language and emits both an incremental transcript of the source and synthesized speech in the target language — over a single WebSocket connection. Use it for simultaneous translation, live captioning with voice-over, and any scenario that needs sub-second turn-around without splitting the pipeline into separate ASR + MT + TTS steps.

Supported languages: English (en), Japanese (ja), Korean (ko), Chinese (zh), and Spanish (es).

What you get back

  • partial_transcript — incremental source-language transcript.
  • audio_chunk — synthesized target-language audio, streamed as it is produced.
  • done — emitted when the server has finished processing the committed audio.

Where to go next

  • Python SDK for S2STclient.s2st.translate(...) for a one-shot file, client.s2st.stream(...) for live audio.
  • API reference — the AsyncAPI spec for the S2ST WebSocket channel.