Speech to Text

Stream live transcripts or transcribe finished audio files.

Kotoba’s Speech-to-Text (ASR) lets you turn audio into text in either of two complementary modes:

  • Live (WebSocket) — push PCM16 chunks as they arrive, read transcript deltas back the same connection. Built for microphones and any latency-sensitive pipeline where the first words matter before the utterance ends.
  • Batch (REST) — POST an audio file and poll until a job completes. Best for long files, offline workflows, and anything that doesn’t need partial results.

Both modes support English (en), Japanese (ja), Korean (ko), and Chinese (zh) input.

Pick a transport

You want…Use
Mic in, text out, while audio is flowingLive (WebSocket)
File in, full transcript outBatch (REST)
Per-segment timestampsBatch (REST) with with_timestamps

Where to go next