OpenAI Releases Three Specialized Realtime Voice API Models
What Shipped on May 7
OpenAI released three new realtime voice models into its API on May 7: gpt-realtime-2, gpt-realtime-translate, and gpt-realtime-whisper. Each handles a distinct voice task — conversation and reasoning, spoken language translation, and live transcription respectively.
The release extends the Realtime API that OpenAI introduced in late 2024. Where the original shipped as a single general-purpose voice model, this update separates the three core jobs of the realtime voice API into purpose-built endpoints. The practical implication: you pick the right tool for the workload instead of prompting a generalist into specialized behavior.
Three Models, Three Distinct Jobs
gpt-realtime-2 is the conversational reasoning model. It handles back-and-forth spoken dialogue — customer service agents, tutors, voice-driven assistants that need to hold context and respond intelligently. This is the direct successor to the earlier realtime voice capability, rebuilt with stronger reasoning.
gpt-realtime-translate does spoken-language translation in real time. A user speaks in Spanish; the model outputs English audio. No chaining of a transcription model, a translation model, and a TTS model together. That three-step pipeline — ASR → MT → synthesis — has been the standard architecture for multilingual voice for years. Collapsing it into one API call reduces latency and removes the compounding error surface of running three models in sequence.
gpt-realtime-whisper is a streaming version of OpenAI’s Whisper transcription model. Unlike the batch Whisper API, which requires a complete audio file before returning results, this model processes audio as it arrives and emits text continuously. The main beneficiary is any interface that needs live captions or immediate text output during speech — not just after the user stops talking.
The Realtime Voice API, Rebuilt
All three models sit behind the same WebSocket-based Realtime API. You connect, stream audio in, and receive audio or text back depending on the modalities you configure. Switching between use cases is a model ID swap at session creation:
{
"model": "gpt-realtime-translate",
"modalities": ["audio"],
"input_audio_format": "pcm16",
"output_audio_format": "pcm16"
}
The specialized model IDs make intent explicit at the API layer instead of encoding it in system prompts. That matters for cost control — you are not paying for reasoning cycles on a pure transcription job — and for throughput, since each model can be optimized for its specific task rather than generalizing across all three.
The Translation Model Is the Sleeper
Of the three, gpt-realtime-translate is the one most likely to enable products that weren’t previously buildable. Real-time spoken translation has always required either a proprietary platform (Google’s Interpreter mode, Apple’s live translation) or a hand-built ML pipeline that very few teams could maintain at production scale. A single API endpoint that takes spoken input in language A and returns spoken output in language B at realtime latency opens up:
- Multilingual customer support apps without a bespoke ASR/MT/TTS stack
- Language learning apps where native speakers respond in real time
- Conference and video SDK plugins that add translation as a one-line feature toggle
None of those were realistically buildable for a small product team before. Infrastructure cost and ML expertise kept real-time spoken translation in the domain of large platforms with proprietary systems.
What This Means for the Products We Ship
Voice input is increasingly the right interaction model for mobile — especially in markets where typing on a small screen is the primary bottleneck and where users span multiple first languages. The apps we build are seeing more voice feature requests, and the historical constraint has been the complexity of stitching together audio pipelines that stay responsive under real-world network conditions.
The three-model split cuts that complexity considerably. Pricing for the new models had not been published at time of writing. Latency under production load — particularly for gpt-realtime-translate — will determine whether this is a ready-to-ship API or still a demo-grade capability. We are integrating it into a prototype to find out.
Sources
- Advancing voice intelligence with new models in the API — OpenAI, May 7 2026
- OpenAI launches new voice intelligence features in its API — TechCrunch, May 7 2026
- OpenAI has new voice models that reason, translate, and transcribe as you speak — 9to5Mac, May 7 2026