DeepL API for speech to text is now generally available via the v3 API endpoint for customers with a DeepL API Pro subscription. The supported scope of DeepL API for the speech to text functionality is covered in this documentation page.Please note that the existing provisions applying to customers’ DeepL API Pro subscription also apply to DeepL API for speech to text with the following applicable additions to the Terms and Conditions, the Service Specification and the Data Processing Agreement (as a new sub-processor has been added to serve specific languages for the API for speech to text).
Overview
The Voice API provides a way to open WebSocket streaming connections to transcribe and translate audio data. With each streaming connection, you can:- Send a single audio stream
- Receive transcriptions in the source language
- Receive translations in multiple target languages
- Request a streaming URL via POST request
- Stream audio via WebSocket
Getting Started
To start using the Voice API:- Ensure you have a DeepL API Pro account with Voice API access
- Review the Request Stream documentation
- Review the WebSocket Streaming documentation
- Choose your audio format and configuration
- Implement the two-step flow in your application
Supported Languages
All source languages can be translated into any target language.Show supported languages
Show supported languages
Source languages
Chinese
Czech
Dutch
English
French
German
Indonesian
Italian
Japanese
Korean
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian
Target languages
Arabic
Bulgarian
Chinese (Simplified)
Chinese (Traditional)
Czech
Danish
Dutch
English (American)
English (British)
Estonian
Finnish
French
German
Greek
Hebrew
Hungarian
Indonesian
Italian
Japanese
Korean
Latvian
Lithuanian
Norwegian Bokmål
Polish
Portuguese (Brazil)
Portuguese (Portugal)
Romanian
Russian
Slovak
Slovenian
Spanish
Swedish
Turkish
Ukrainian
Vietnamese
Supported Audio Formats
The API supports various common combinations of streaming codecs and containers with a single channel (mono) audio stream. For a detailed list, please refer to Source Media Content Type.| Audio Codec | Audio Container | Recommended Bitrate |
|---|---|---|
| PCM | - | 256 kbps (16kHz), default recommendation |
| OPUS | Matroska / Ogg / WebM | 32 kbps, recommended for low bandwidth scenarios |
| AAC | Matroska | 96 kbps |
| FLAC | FLAC / Matroska / Ogg | 256 kbps (16kHz) |
| MP3 | Matroska / MPEG | 128 kbps |
Two-Step API Flow
The Voice API uses a two-step flow to initiate streaming.1
Request Stream
Make a POST request This step handles:See the Request Stream documentation for details.
v3/voice/realtime to obtain an ephemeral streaming URL and authentication token. The response will look like this:- Authentication and authorization
- Main configuration options (audio format, languages, glossaries, etc.)
URL and token are valid for one-time use only.
2
Streaming Audio and Text (WebSocket)
Use the received URL to establish a WebSocket connection to See the WebSocket Streaming documentation for details.
wss://api.deepl.com/v3/voice/realtime/connect?token=<secure access token>.
This step handles exchanging JSON messages on the WebSocket connection:- Sending audio data
- Receiving transcriptions and translations in real-time
Once a WebSocket connection is established, you must send audio data to prevent connection closure within 30 seconds.
Show streaming flow
Show streaming flow
The following sequence diagram shows the flow of messages.
par means parallel execution and loop means looped execution.Limitations and Constraints
- Maximum 5 target languages per stream
- Maximum streaming connection duration: 3 hours
- Audio chunk size: should not exceed 100 kilobyte or 1 second duration
- Recommended chunk duration: 50-250 milliseconds for low latency
- Audio stream speed: maximum 2x real-time
- Timeout: If no data is received for 30 seconds, the session will be terminated