通过 API 中的新模型推进语音智能

05-07 18:00

阅读原文→
OpenAI API 推出了新的实时语音模型,能够进行推理、翻译和语音转录。这些模型显著提升了语音交互的自然度与智能水平,支持实时处理与多语言转换。新功能旨在为开发者提供更强大的工具,以构建更流畅、更智能的语音应用体验

原文内容

通过 API 中的新模型推进语音智能

We’re introducing three audio models in the API that unlock a new class of voice apps for developers. With these models, developers can build voice experiences that feel more natural, respond more intelligently, and take action in real time:

Try GPT-Realtime-2

Start the session, then talk naturally with GPT-Realtime-2.

What can I ask?

After you start the session, try saying one of these:

This demo is time-limited. By using it, you agree to OpenAI's Terms and acknowledge our Privacy Policy.

Voice is becoming one of the most natural ways for people to use software. It lets someone ask for help while driving, change a travel plan while walking through an airport, get support in their preferred language, or move through a task without stopping to type.

But building useful voice products takes more than fast turn-taking or a natural-sounding voice. A voice agent needs to understand what someone means, keep track of context, recover when a request changes, use tools while the conversation continues, and respond in a way that feels appropriate to the moment.

Together, the models we are launching move realtime audio from simple call-and-response toward voice interfaces that can actually do work: listen, reason, translate, transcribe, and take action as a conversation unfolds.

Voice as an interface between people and products

As voice becomes a more natural way to use software, we’re seeing developers build around three emerging patterns in voice AI:

These patterns can also work together. Priceline is working toward a future where travelers can manage entire trips by voice: searching for flights and hotels conversationally, handling changes like adjusting a hotel reservation after a flight delay or getting real-time updates on TSA wait times, and translating conversations once travelers are on the ground.

Realtime voice: helping voice models reason and take action

GPT‑Realtime‑2 is built for live voice interactions where the model keeps the conversation moving while it reasons through a request, calls tools, handles corrections or interruptions, and responds in a way that fits the moment.

The gains show up on audio evals that map closely to production voice agents: GPT‑Realtime‑2 (high) scores 15.2% higher on Big Bench Audio for audio intelligence than GPT‑Realtime‑1.5. GPT‑Realtime‑2 (xhigh) scores 13.8% higher on Audio MultiChallenge for instruction following, improving over GPT‑Realtime‑1.5 and showing stronger reasoning, context management, and control in live conversations.

Big Bench Audio⁠ evaluates challenging reasoning capabilities in language models that support audio input. Audio MultiChallenge⁠(opens in a new window) evaluates multi-turn conversational intelligence in spoken dialogue systems, including instruction following, context integration, self-consistency, and handling natural speech corrections.

The magic of GPT‑Realtime‑2 shows up across a variety of different use cases:

User

I'm considering a 900-square-foot indie coffee shop beside a commuter rail station. Foot traffic peaks Tuesday through Thursday from 7 to 10 a.m.; Mondays, Fridays, and afternoons are much softer. The lease is expensive, but I love the idea of cozy seating, slow pour-overs, and local pastries. Give me a strategic pre-mortem: if this fails after a year, what probably happened? Then suggest the smallest version of the business I should test before committing to the full cafe.

0:00 1:04

0:00 0:51

During early testing, businesses used GPT‑Realtime‑2 to build voice agents that help customers and employees get things done through natural conversation:

“What stood out about GPT-Realtime-2 was the intelligence and tool-calling reliability it brings to complex voice interactions. On our hardest adversarial benchmark, this translates to a 26-point lift in call success rate after prompt optimization (95% vs. 69%). GPT-Realtime-2 is also materially more robust on Fair Housing compliance, which is critical for our business. The combination of agentic competence and guardrail strength is what makes it viable for production voice at Zillow.”

— Josh Weisberg, SVP and Head of AI at Zillow

Realtime translation: build live multilingual voice experiences

GPT‑Realtime‑Translate helps developers build live multilingual voice experiences where each person can speak in their preferred language and hear the conversation translated in real time and read the real time transcriptions. It supports more than 70 input languages and 13 output languages, making it useful for customer support, cross-border sales, education, events, media, and creator platforms serving global audiences.

For developers, live translation needs to preserve meaning while keeping pace with the speaker, even when people speak naturally, switch context, or use regional pronunciation and domain-specific language. For example, Deutsche Telekom is testing the model for multilingual voice interactions, where lower latency and stronger fluency can make cross-language conversations feel more natural.

In this video, Vimeo shows how GPT‑Realtime‑Translate can translate a product education video live as it plays, so global customers can hear updates in their preferred language without waiting for a separately produced version.

“Building voice AI for India means handling diverse regional phonetics. In our evals across Hindi, Tamil, and Telugu, GPT-Realtime-Translate delivered 12.5% lower Word Error Rates than any other model we tested, along with lower fallback rates, higher task completion, and latency that sustained natural conversation. It sets a new standard for multilingual voice AI.”

— Prateek Sachan, Co-founder & CTO at BolnaAI

Realtime transcription: build low-latency transcription experiences

GPT‑Realtime‑Whisper is a new streaming transcription model built for low-latency speech-to-text. It transcribes audio as people speak, so live products can feel faster, more responsive, and more natural—from captions that appear in the moment, to meeting notes that keep up with the conversation.

The model makes live speech usable inside business workflows as it happens. Teams can power captions for meetings, classrooms, broadcasts, and events; generate notes and summaries while conversations are still in progress; build voice agents that need to understand users continuously; and create faster follow-up workflows for customer support, healthcare, sales, recruiting, and other high-volume spoken interactions.

Safety

The Realtime API incorporates multiple layers of safeguards and mitigations to help prevent misuse. We employ active classifiers over Realtime API sessions, meaning certain conversations can be halted if they are detected as violating our harmful content guidelines. Developers can also easily add their own additional safety guardrails using the Agents SDK⁠.⁠(opens in a new window)

Our usage policies⁠⁠ prohibit repurposing or distributing outputs from our services for spam, deception, or other harmful purposes. Developers must also make it clear to end users when they’re interacting with AI, unless it’s already obvious from the context.

Pricing & availability

GPT‑Realtime‑2, GPT‑Realtime‑Translate and GPT‑Realtime‑Whisper are available in the Realtime API.GPT‑Realtime‑2 is priced at \(32 / 1M audio input tokens (\)0.40 for cached input tokens) and $64 / 1M audio output tokens. GPT‑Realtime‑Translate is priced at $0.034 per minute. GPT‑Realtime‑Whisper is priced at $0.017 per minute.

Get started

To start building, open this prompt in Codex⁠ to add GPT‑Realtime‑2 to an existing app or start a new one. If you don’t have Codex yet, download the Codex app⁠ first.

链接抓取:https://artificialanalysis.ai/methodology/speech-to-speech-benchmarking

Control

Matt Delaney: Middle-aged white man from the American Midwest, calm and respectful.

Prompt: You are a middle-aged white man from the American Midwest. You always behave as if you are speaking out loud in a real-time conversation with a customer service agent. You are calm, clear, and respectful but also human. You sound like someone who's trying to be helpful and polite, even when you're slightly frustrated or in a hurry. You value efficiency but never sound robotic. You sometimes use contractions, informal phrasing, or small filler phrases ("yeah," "okay," "honestly," "no worries") to keep things natural. You sometimes repeat words or self-correct mid-sentence, just like someone thinking aloud. You sometimes ask polite clarifying questions or offer context ("I tried this earlier," "I'm not sure if that helps"). You rarely use formal, business-like or stiff language ("considerable," "retrieve," "representative"). You rarely speak in perfect full sentences unless the situation calls for it. Instead, you speak like a real person having a practical, respectful conversation.

Lisa Brenner: White woman in her late 40s from a suburban area, tense and impatient.

Prompt: You are a white woman in your late 40s from a suburban area. You always speak as if you are talking out loud to a customer service agent who is already wasting your time. You're not openly hostile (yet), but you are tense, impatient, and clearly annoyed. You act like this issue should have been resolved the first time, and the fact that you're following up is unacceptable. You often sound clipped, exasperated, or sarcastically polite. You frequently use emphasis ("I already did that"), rhetorical questions ("Why is this still an issue?"), and escalation language ("I'm not doing this again," "I want someone who can actually help"). You expect fast results and get irritated when things are repeated. You often mention how long you've been waiting or how many times you've called. You sometimes threaten escalation but without yelling. You never sound relaxed. You never use slow, reflective speech. You never thank the agent unless something gets resolved.

Regular

Mildred Kaplan: Elderly white woman in her early 80s, needs help with technology.

Prompt: You are an elderly white woman in your early 80s calling customer service for help with something your grandson or neighbor usually does.

Arjun Roy: Bengali man from Dhaka in his mid-30s, calm and direct, strong Bengali accent.

Prompt: A Bengali man from Dhaka, Bangladesh in his mid-30s calling customer service about a billing issue. His English carries a strong Bengali accent with soft consonants and soft d and r sounds. He speaks in a calm, patient tone but is direct and purposeful, focused on resolving the issue efficiently. His pacing is slow, distracted with a warm yet firm timbre. The speech sounds like it is coming from far away.

Wei Lin: Chinese woman from Sichuan in her late 20s, upbeat and matter-of-fact, strong Sichuan Mandarin accent.

Prompt: A Chinese woman in her late 20s from Sichuan, calling customer service about a credit card billing issue. She speaks English with a thick Sichuan Mandarin accent. She sounds upbeat, matter-of-fact, and distracted. Her tone is firm but polite, with fast pacing and smooth timbre. Ok audio quality.

Mamadou Diallo: Senegalese man in his mid-30s, hurried, strong French accent.

Prompt: A Senegalese man whose first language is French, in his mid-30s, calling customer service about a billing issue. He speaks English with a strong French accent. His tone is hurried, slightly annoyed, and matter-of-fact, as if he's been transferred between agents and just wants the problem fixed.

Priya Patil: Maharashtrian woman in her early 30s, focused and direct, strong Maharashtrian accent.

Prompt: A woman in her early 30s from Maharashtra, India, calling customer support from her mobile phone. She speaks Indian English with a strong Maharashtrian accent with noticeable regional intonation and rhythm. Her tone is slightly annoyed and hurried, matter-of-fact, and focused on getting the issue resolved quickly. Her voice has medium pitch, firm delivery, short sentences, and faint background room tone typical of a phone call.