Technology

OpenAI unveils new audio models to redefine voice AI with real-time speech capabilities | Technology News

OpenAI has unveiled a new suite of audio models to power its voice agents, and it is now available to developers around the world. The latest updates mark a major step in voice AI technology. The AI powerhouse has introduced new tools and models that could enable developers to create voice agents, or AI-driven systems that are capable of real-time speech interactions.
Even though voice is a natural human interface, it remains largely underutilised in AI applications of today. With the slew of updates, OpenAI is aiming to change this, essentially enabling businesses and developers to create more sophicated voice agents. These systems can function on their own, assing users through spoken interactions under various use cases that could range from customer care to learning languages.
What’s new?
OpenAI has introduced three main advancements in audio AI. These are two state-of-the-art speech-to-text models, a new text-to-speech model, and some enhancements to the Agents SDK. The new speech-to-text models have outperformed OpenAI’s previous Whisper models in almost all tested languages, with significant improvements in transcription accuracy and efficiency.
Story continues below this ad

On the other hand, the new text-to-speech model enables precise control over not just the spoken words but how they are said, enhancing the overall expressiveness of AI-generated speech. With the Agents SDK, the latest update makes it easier to convert text-based agents into voice-based AI assants offering seamless interactions.
What do voice agents do?
Voice agents function similarly to text-based AI assants. However, they operate through speech instead of text interactions. Some use cases include customer support, where AI answers calls and handles queries; language learning, where an AI-powered coach can help users with pronunciations and practise conversations; and accessibility tools, where they offer voice-controlled assants for users with disabilities.
How to build voice AI?
When it comes to building voice AI, there are essentially two approaches – speech-to-speech (S2S) and speech-to-text-to-speech (S2T2S). S2S models take spoken input and produce spoken output without intermediate transcription. Reportedly, this approach maintains nuances like intonation, emotion, and emphasis. Meanwhile, S2T2S models initially transcribe speech as text, process it, and convert it back into speech. Although these are easier to implement, they often lose key details and may add latency. OpenAI’s latest updates emphasise the advantages of speech-to-speech processing, making AI interactions more natural and fluid.

GPT-4o Transcribe and GPT-4o Mini Transcribe
OpenAI has also introduced two new transcription models – GPT-4o Transcribe and GPT-4o Mini Transcribe. While GPT-4o Transcribe is a large speech model that has been trained on vast amounts of audio data with highly accurate transcriptions, the GPT-4o Mini Transcribe is a smaller, more efficient model that has been designed for faster and cost-efficient transcription. OpenAI has claimed that both models deliver industry-leading word error rates, significantly improving upon previous Whisper versions. When it comes to pricing, GPT-4o Transcribe is offered at $0.006 per minute, the same as Whisper, while GPT-4o Mini Transcribe is at $0.03 per minute.Story continues below this ad
The latest updates from OpenAI seem to suggest that voice would be a key focus area for AI development. These models with their affordability factor are likely to push businesses and developers to build high-quality voice agents.
© IE Online Media Services Pvt Ltd

Expand

Related Articles

Back to top button