05/07/2026 | Press release | Distributed by Public on 05/07/2026 11:29
A new generation of realtime voice models that can reason, translate, and transcribe as people speak.
We're introducing three audio models in the API that unlock a new class of voice apps for developers. With these models, developers can build voice experiences that feel more natural, respond more intelligently, and take action in real time:
After you start the session, try saying one of these:
This demo is time-limited. By using it, you agree to OpenAI's Terms and acknowledge our Privacy Policy.
Voice is becoming one of the most natural ways for people to use software. It lets someone ask for help while driving, change a travel plan while walking through an airport, get support in their preferred language, or move through a task without stopping to type.
But building useful voice products takes more than fast turn-taking or a natural-sounding voice. A voice agent needs to understand what someone means, keep track of context, recover when a request changes, use tools while the conversation continues, and respond in a way that feels appropriate to the moment.
Together, the models we are launching move realtime audio from simple call-and-response toward voice interfaces that can actually do work: listen, reason, translate, transcribe, and take action as a conversation unfolds.
As voice becomes a more natural way to use software, we're seeing developers build around three emerging patterns in voice AI:
These patterns can also work together. Priceline is working toward a future where travelers can manage entire trips by voice: searching for flights and hotels conversationally, handling changes like adjusting a hotel reservation after a flight delay or getting real-time updates on TSA wait times, and translating conversations once travelers are on the ground.
GPT-Realtime-2 is built for live voice interactions where the model keeps the conversation moving while it reasons through a request, calls tools, handles corrections or interruptions, and responds in a way that fits the moment.
The gains show up on audio evals that map closely to production voice agents: GPT-Realtime-2 (high) scores 15.2% higher on Big Bench Audio for audio intelligence than GPT-Realtime-1.5. GPT-Realtime-2 (xhigh) scores 13.8% higher on Audio MultiChallenge for instruction following, improving over GPT-Realtime-1.5 and showing stronger reasoning, context management, and control in live conversations.
Big Bench Audio evaluates challenging reasoning capabilities in language models that support audio input. Audio MultiChallenge (opens in a new window) evaluates multi-turn conversational intelligence in spoken dialogue systems, including instruction following, context integration, self-consistency, and handling natural speech corrections.
The magic of GPT-Realtime-2 shows up across a variety of different use cases:
During early testing, businesses used GPT-Realtime-2 to build voice agents that help customers and employees get things done through natural conversation:
GPT-Realtime-Translate helps developers build live multilingual voice experiences where each person can speak in their preferred language and hear the conversation translated in real time and read the real time transcriptions. It supports more than 70 input languages and 13 output languages, making it useful for customer support, cross-border sales, education, events, media, and creator platforms serving global audiences.
For developers, live translation needs to preserve meaning while keeping pace with the speaker, even when people speak naturally, switch context, or use regional pronunciation and domain-specific language. For example, Deutsche Telekom is testing the model for multilingual voice interactions, where lower latency and stronger fluency can make cross-language conversations feel more natural.
In this video, Vimeo shows how GPT-Realtime-Translate can translate a product education video live as it plays, so global customers can hear updates in their preferred language without waiting for a separately produced version.
GPT-Realtime-Whisper is a new streaming transcription model built for low-latency speech-to-text. It transcribes audio as people speak, so live products can feel faster, more responsive, and more natural-from captions that appear in the moment, to meeting notes that keep up with the conversation.
The model makes live speech usable inside business workflows as it happens. Teams can power captions for meetings, classrooms, broadcasts, and events; generate notes and summaries while conversations are still in progress; build voice agents that need to understand users continuously; and create faster follow-up workflows for customer support, healthcare, sales, recruiting, and other high-volume spoken interactions.
The Realtime API incorporates multiple layers of safeguards and mitigations to help prevent misuse. We employ active classifiers over Realtime API sessions, meaning certain conversations can be halted if they are detected as violating our harmful content guidelines. Developers can also easily add their own additional safety guardrails using the Agents SDK. (opens in a new window)
Our usage policies prohibit repurposing or distributing outputs from our services for spam, deception, or other harmful purposes. Developers must also make it clear to end users when they're interacting with AI, unless it's already obvious from the context.
The Realtime API fully supports EU Data Residency (opens in a new window) for EU-based applications and is covered by our enterprise privacy commitments .
GPT-Realtime-2, GPT-Realtime-Translate and GPT-Realtime-Whisper are available in the Realtime API. GPT-Realtime-2 is priced at $32 / 1M audio input tokens ($0.40 for cached input tokens) and $64 / 1M audio output tokens. GPT-Realtime-Translate is priced at $0.034 per minute. GPT-Realtime-Whisper is priced at $0.017 per minute.
You can test the new realtime voice models in the Playground (opens in a new window). If you have Codex installed, click submit on the prompt below to add GPT-Realtime-2 to your existing app or create a new app with it.
Build or add a minimal Realtime 2 WebRTC voice agent using the gpt-realtime-2 model.
Use the latest OpenAI Realtime API docs for the WebRTC and session setup patterns. If this folder already contains an app, add it to the existing app. Otherwise, create a small local web app. Add a server-side session endpoint that uses OPENAI_API_KEY and posts browser SDP to /v1/realtime/calls following the docs exactly: multipart FormData fields named sdp and session, not file uploads. Connect browser microphone input and model audio output with RTCPeerConnection, open an oai-events data channel, and register one sample function tool with session.update: check_calendar(date, time), which returns whether the requested time is available.
Keep it small and include setup/run instructions.
Submit