Major ChatGPT-4o update allows audio-video talks with an “emotional” AI chatbot

On Monday, OpenAI debuted GPT-4o (o for “omni”), a major new AI model that can ostensibly converse using speech in real time, reading emotional cues and responding to visual input. It operates faster than OpenAI’s previous best model, GPT-4 Turbo, and will be free for ChatGPT users and available as a service through API, rolling out over the next few weeks, OpenAI says.

OpenAI revealed the new audio conversation and vision comprehension capabilities in a YouTube livestream titled “OpenAI Spring Update,” presented by OpenAI CTO Mira Murati and employees Mark Chen and Barret Zoph that included live demos of GPT-4o in action.

OpenAI claims that GPT-4o responds to audio inputs in about 320 milliseconds on average, which is similar to human response times in conversation, according to a 2009 study, and much shorter than the typical 2–3 second lag experienced with previous models. With GPT-4o, OpenAI says it trained a brand-new AI model end-to-end using text, vision, and audio in a way that all inputs and outputs “are processed by the same neural network.”

OpenAI Spring Update.

“Because GPT-4o is our first model combining all of these modalities, we are still just scratching the surface of exploring what the model can do and its limitations,” OpenAI says.

During the livestream, OpenAI demonstrated GPT-4o’s real-time audio conversation capabilities, showcasing its ability to engage in natural, responsive dialogue. The AI assistant seemed to easily pick up on emotions, adapted its tone and style to match the user’s requests, and even incorporated sound effects, laughing, and singing into its responses.

OpenAI CTO Mira Murati seen debuting GPT-4o during OpenAI's Spring Update livestream on May 13, 2024. — Enlarge / OpenAI CTO Mira Murati seen debuting GPT-4o during OpenAI’s Spring Update livestream on May 13, 2024.

OpenAI

The presenters also highlighted GPT-4o’s enhanced visual comprehension. By uploading screenshots, documents containing text and images, or charts, users can apparently hold conversations about the visual content and receive data analysis from GPT-4o. In the live demo, the AI assistant demonstrated its ability to analyze selfies, detect emotions, and engage in lighthearted banter about the images.

Additionally, GPT-4o exhibited improved speed and quality in more than 50 languages, which OpenAI says covers 97 percent of the world’s population. The model also showcased its real-time translation capabilities, facilitating conversations between speakers of different languages with near-instantaneous translations.

OpenAI first added conversational voice features to ChatGPT in September 2023 that utilized Whisper, an AI speech recognition model, for input and a custom voice synthesis technology for output. In the past, OpenAI’s multimodal ChatGPT interface used three processes: transcription (from speech to text), intelligence (processing the text as tokens), and text to speech, bringing increased latency with each step. With GPT-4o, all of those steps reportedly happen at once. It “reasons across voice, text, and vision,” according to Murati. They called this an “omnimodel” in a slide shown on-screen behind Murati during the livestream.

OpenAI announced that GPT-4o will be accessible to all ChatGPT users, with paid subscribers having access to five times the rate limits of free users. GPT-4o in API form will also reportedly feature twice the speed, 50 percent lower cost, and five-times higher rate limits compared to GPT-4 Turbo.

In <em>Her</em>, the main character talks to an AI personality through wireless earbuds similar to AirPods. — Enlarge / In *Her*, the main character talks to an AI personality through wireless earbuds similar to AirPods.

Warner Bros.

The capabilities demonstrated during the livestream and numerous videos on OpenAI’s website recall the conversational AI agent in the 2013 sci-fi film Her. In that film, the lead character develops a personal attachment to the AI personality. With the simulated emotional expressiveness of GPT-4o from OpenAI (artificial emotional intelligence, you could call it), it’s not inconceivable that similar emotional attachments on the human side may develop with OpenAI’s assistant, as we’ve already seen in the past.

Murati acknowledged the new challenges posed by GPT-4o’s real-time audio and image capabilities in terms of safety, and stated that the company will continue researching safety and soliciting feedback from test users during its iterative deployment over the coming weeks.

“GPT-4o has also undergone extensive external red teaming with 70+ external experts in domains such as social psychology, bias and fairness, and misinformation to identify risks that are introduced or amplified by the newly added modalities,” says OpenAI. “We used these learnings [sic] to build out our safety interventions in order to improve the safety of interacting with GPT-4o. We will continue to mitigate new risks as they’re discovered.”

Updates to ChatGPT

Also on Monday, OpenAI announced several updates to ChatGPT, including a ChatGPT desktop app for macOS, which will be available for ChatGPT Plus users today and will become “more broadly available” in the coming weeks, according to OpenAI. OpenAI is also streamlining the ChatGPT interface with a new home screen and message layout.

And as we mentioned briefly above, when using the GPT-4o model (once it becomes widely available), ChatGPT Free users will have access to web browsing, data analytics, the GPT Store, and Memory features, which were previously limited to ChatGPT Plus, Team, and Enterprise subscribers.