OpenAI Launches GPT-Realtime-2: Voice Models with GPT-5-Class Reasoning, Live Translation, and Streaming Transcription
Technology📅 May 7, 2026👤 FreeReadText Team

OpenAI Launches GPT-Realtime-2: Voice Models with GPT-5-Class Reasoning, Live Translation, and Streaming Transcription

OpenAI introduces three new Realtime API voice models — GPT-Realtime-2 with GPT-5-class reasoning, GPT-Realtime-Translate covering 70+ input languages, and GPT-Realtime-Whisper for live transcription — quadrupling the context window to 128K tokens and bringing voice agents closer to production-ready workflows.

On May 7, 2026, OpenAI launched three new voice intelligence models through its Realtime API: GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. The company framed the release as moving 'real-time audio from simple call-and-response toward voice interfaces that can actually do work,' positioning the lineup squarely at developers building production voice agents rather than experimental demos.

GPT-Realtime-2 is OpenAI's first voice model with GPT-5-class reasoning. Its context window expands from 32,000 to 128,000 tokens, enabling longer sessions and more complex agentic flows without external state management. The model exposes five adjustable reasoning effort levels — from minimal through very high — alongside parallel tool calls and preambles that let an agent say phrases like 'let me check that' while it queries back-end systems. Domain comprehension covers proper nouns and medical terminology, and EU Data Residency is available for enterprise deployments. Pricing is set at $32 per million audio input tokens, $0.40 per million cached input tokens, and $64 per million audio output tokens.

GPT-Realtime-Translate provides live speech translation across more than 70 input languages and 13 output languages, maintaining pace with the speaker for real-time multilingual conversations. Deutsche Telekom is exploring it for customer support and Vimeo for educational video localization. GPT-Realtime-Whisper is a streaming speech-to-text model that transcribes audio as it is spoken, targeting use cases like live captions and meeting notes. Translate is billed at $0.034 per minute and Whisper at $0.017 per minute — pricing that favors short, bursty workloads over per-token billing.

The launch lands in an increasingly crowded field. Google's Gemini 3.1 Flash TTS reached the Artificial Analysis leaderboard in April, Meta open-sourced Llama-Voice the same month, and Mistral published its Voxtral TTS weights in late March. OpenAI's differentiator is integration: combining reasoning, translation, and transcription in one API removes the need to stitch together a separate LLM, TTS, and STT stack. The company emphasized that Realtime API conversations now ship with active safety classifiers that can halt sessions violating OpenAI's content guidelines, with developers able to layer additional protections through the Agents SDK.

OpenAIGPT-Realtime-2Realtime APILive TranslationSpeech-to-TextVoice Agents

Kilde

← Back to News