Speech to Speech Translation
Translate spoken audio into spoken audio in another language.
Kotoba’s Speech-to-Speech translation (S2ST) ingests audio in one language and emits both an incremental transcript of the source and synthesized speech in the target language — over a single WebSocket connection. Use it for simultaneous translation, live captioning with voice-over, and any scenario that needs sub-second turn-around without splitting the pipeline into separate ASR + MT + TTS steps.
Supported languages: English (en), Japanese (ja), Korean (ko),
Chinese (zh), and Spanish (es).
What you get back
partial_transcript— incremental source-language transcript.audio_chunk— synthesized target-language audio, streamed as it is produced.done— emitted when the server has finished processing the committed audio.
Where to go next
- Python SDK for S2ST —
client.s2st.translate(...)for a one-shot file,client.s2st.stream(...)for live audio. - API reference — the AsyncAPI spec for the S2ST WebSocket channel.