Speech to Text

Stream live transcripts or transcribe finished audio files.

Kotoba’s Speech-to-Text (ASR) lets you turn audio into text in either of two complementary modes:

Live (WebSocket) — push PCM16 chunks as they arrive, read transcript deltas back the same connection. Built for microphones and any latency-sensitive pipeline where the first words matter before the utterance ends.
Batch (REST) — POST an audio file and poll until a job completes. Best for long files, offline workflows, and anything that doesn’t need partial results.

Both modes support English (en), Japanese (ja), Korean (ko), and Chinese (zh) input.

Pick a transport

You want…	Use
Mic in, text out, while audio is flowing	Live (WebSocket)
File in, full transcript out	Batch (REST)
Per-segment timestamps	Batch (REST) with `with_timestamps`

Where to go next

Python SDK for ASR — client.asr.transcribe(...) for REST batch, client.asr.transcribe_stream(...) for live.
API reference — Live (WebSocket) — the AsyncAPI spec for the streaming channel.
API reference — Batch (REST) — the OpenAPI spec for the async transcription job endpoints.