POST /v1/audio/transcriptions
Convert an audio file to text in its original language. Compatible with OpenAI Whisper.Request body
This endpoint usesmultipart/form-data encoding.
The transcription model to use. For example,
whisper-1. The available values depend on your configured channels.The audio file to transcribe. Supported formats include
mp3, mp4, mpeg, mpga, m4a, ogg, wav, and webm. Maximum file size is 25 MB.The language of the audio, as an ISO-639-1 code (e.g.,
en, zh, fr). Providing this hint improves accuracy and speed. If omitted, the model detects the language automatically.The format of the transcription output. One of
json, text, srt, verbose_json, or vtt. Defaults to json.Optional text to guide the model’s transcription style or vocabulary.
Response
Forresponse_format: json, the response is:
The transcribed text.
response_format: verbose_json, additional fields are returned:
Always
"transcribe".The detected or provided language.
Duration of the audio in seconds.
Time-aligned segments of the transcription.
Example
curl
POST /v1/audio/translations
Transcribe and translate an audio file into English, regardless of the source language.Request body
This endpoint usesmultipart/form-data encoding.
The model to use for translation. For example,
whisper-1.The audio file to translate. Same format restrictions as
/v1/audio/transcriptions.Output format:
json, text, srt, verbose_json, or vtt. Defaults to json.Optional text to guide the model’s translation style.
Response
The translated text in English.
Example
curl
POST /v1/audio/speech
Generate spoken audio from a text string (text-to-speech).Request body
The TTS model to use. For example,
tts-1 or tts-1-hd. tts-1-hd produces higher-quality audio at higher cost.The text to convert to speech. Maximum length is 4,096 characters.
The voice to use for synthesis. OpenAI TTS supports
alloy, echo, fable, onyx, nova, and shimmer. The available voices depend on your configured provider.The audio output format. One of
mp3, opus, aac, or flac. Defaults to mp3.The playback speed of the generated audio. A value between
0.25 and 4.0. Defaults to 1.0.Optional text instructions to control speaking style, tone, or pacing.
Response
The response body is raw audio binary data in the format specified byresponse_format. Set your HTTP client to save the response directly to a file.