Translate Speech in Realtime - DeepL Documentation

The Voice API provides real-time voice transcription and translation services using WebSocket streaming.

DeepL API for speech to text is now generally available via the v3 API endpoint for customers with a DeepL API Pro subscription. The supported scope of DeepL API for the speech to text functionality is covered in this documentation page.Please note that the existing provisions applying to customers’ DeepL API Pro subscription also apply to DeepL API for speech to text with the following applicable additions to the Terms and Conditions, the Service Specification and the Data Processing Agreement (as a new sub-processor has been added to serve specific languages for the API for speech to text).

Overview

The Voice API provides a way to open WebSocket streaming connections to transcribe and translate audio data. With each streaming connection, you can:

Send a single audio stream
Receive transcriptions in the source language
Receive translations in multiple target languages

The API uses a two-step flow:

Request a streaming URL via POST request
Stream audio via WebSocket

Getting Started

To start using the Voice API:

Ensure you have a DeepL API Pro account with Voice API access
Review the Request Stream documentation
Review the WebSocket Streaming documentation
Choose your audio format and configuration
Implement the two-step flow in your application

Supported Languages

All source languages can be translated into any target language.

Show supported languages

Source languages

Chinese

Czech

Dutch

English

French

German

Indonesian

Italian

Japanese

Korean

Polish

Portuguese

Romanian

Russian

Spanish

Swedish

Turkish

Ukrainian

Target languages

Arabic

Bulgarian

Chinese (Simplified)

Chinese (Traditional)

Czech

Danish

Dutch

English (American)

English (British)

Estonian

Finnish

French

German

Greek

Hebrew

Hungarian

Indonesian

Italian

Japanese

Korean

Latvian

Lithuanian

Norwegian Bokmål

Polish

Portuguese (Brazil)

Portuguese (Portugal)

Romanian

Russian

Slovak

Slovenian

Spanish

Swedish

Turkish

Ukrainian

Vietnamese

Supported Audio Formats

The API supports various common combinations of streaming codecs and containers with a single channel (mono) audio stream. For a detailed list, please refer to Source Media Content Type.

Audio Codec	Audio Container	Recommended Bitrate
PCM	-	256 kbps (16kHz), default recommendation
OPUS	Matroska / Ogg / WebM	32 kbps, recommended for low bandwidth scenarios
AAC	Matroska	96 kbps
FLAC	FLAC / Matroska / Ogg	256 kbps (16kHz)
MP3	Matroska / MPEG	128 kbps

Two-Step API Flow

The Voice API uses a two-step flow to initiate streaming.

Request Stream

Make a POST request v3/voice/realtime to obtain an ephemeral streaming URL and authentication token. The response will look like this:

{
  "streaming_url": "wss://api.deepl.com/v3/voice/realtime/connect",
  "token": <secure access token>,
}

This step handles:

Authentication and authorization
Main configuration options (audio format, languages, glossaries, etc.)

URL and token are valid for one-time use only.

See the Request Stream documentation for details.

Streaming Audio and Text (WebSocket)

Use the received URL to establish a WebSocket connection to wss://api.deepl.com/v3/voice/realtime/connect?token=<secure access token>. This step handles exchanging JSON messages on the WebSocket connection:

Sending audio data
Receiving transcriptions and translations in real-time

Once a WebSocket connection is established, you must send audio data to prevent connection closure within 30 seconds.

See the WebSocket Streaming documentation for details.

Show streaming flow

The following sequence diagram shows the flow of messages.par means parallel execution and loop means looped execution.

Limitations and Constraints

Maximum 5 target languages per stream
Maximum streaming connection duration: 3 hours
Audio chunk size: should not exceed 100 kilobyte or 1 second duration
Recommended chunk duration: 50-250 milliseconds for low latency
Audio stream speed: maximum 2x real-time
Timeout: If no data is received for 30 seconds, the session will be terminated

API Reference

​Overview

​Getting Started

​Supported Languages

​Supported Audio Formats

​Two-Step API Flow

​Limitations and Constraints

Overview

Getting Started

Supported Languages

Supported Audio Formats

Two-Step API Flow

Limitations and Constraints