What can OpenAI’s new GPT-4o AI model do? | Explained

Days after denying rumours of a new AI search engine and a GPT-5 release, OpenAI livestreamed the launch of its new flagship AI model, the GPT-4o, capable of accepting audio and visual inputs and generating output almost flawlessly. The ‘o’ in GPT-4o stands for “omni,” which means it can receive multimodal inputs through text, audio, and images, unlike the early days of ChatGPT, when users had to submit text to receive a response text.

OpenAI claims GPT-4o can achieve a response time of 232 milliseconds for audio input, while its average response time is 320 milliseconds. The AI interface uses the usual fillers, or sometimes repeats part of the question to cover for this latency.

While users could already use tools to vocally communicate with ChatGPT, that feature worked by clubbing three models: turning the user’s voice into text, carrying out operations, and returning an audio-based result. With GPT-4o, the same neural network takes care of these layers, and the model is able to respond faster and glean more insights from the user and their surroundings.

(For top technology news of the day, subscribe to our tech newsletter Today’s Cache)

What all can GPT-4o do?

OpenAI ran several demos to show off the diverse abilities of GPT-4o across audio, images, and text. The AI interface, based on a user’s instructions, can turn a picture of a man into a caricature, create and manipulate a 3D logo, or attach a logo to an object. It can also generate meeting notes based on an audio recording, design a cartoon character, and even make a stylised movie poster with real people’s photos.

In promotional video clippings, GPT-4o assessed a man’s readiness for an interview and made jokes about him for being dressed too casually, thus demonstrating its visual understanding. In others, it helped set up a game, assisted a child in solving a math problem, recognised real-life objects in Spanish, and even expressed sarcasm.

OpenAI did not shy away from praising the new model, claiming that it beat existing rivals such as Claude 3 Opus and Gemini Ultra 1.0, as well as its own GPT-4 offering, in several areas across text evaluation and vision understanding evaluations.

What can’t it do?

While GPT-4o can process text, audio, and images, one noticeable omission is video generation – despite the model’s vision understanding capability. So, users cannot ask GPT-4o to give them a fleshed-out movie trailer, but they can ask the model questions about their surroundings by making the AI see the user’s environment through their smartphone’s camera.

Furthermore, GPT-4o made some slip-ups and errors when demonstrating its abilities. For example, when converting two portraits into a crime movie-style poster, the model initially produced gibberish instead of text. Though the results were later refined, the final product also had a slightly raw AI-generated feel.

GPT-4o comes at a crucial time for the ChatGPT-maker, which is now in competition with other Big Tech firms fine-tuning their own models or turning them into business tools.

While companies like Google are freely offering their chatbots that access information in real time, OpenAI fell behind as it put in place a knowledge cut-off for the most basic and free version of ChatGPT. This means non-paying users were receiving outdated information from a less developed model when compared to users trying out cutting-edge offerings from rivals.

It remains to be seen how far GPT-4o will enhance the ChatGPT experience for non-paying users.

Who can use this AI model?

ChatGPT will immediately be getting GPT-4o’s text and image capabilities, said OpenAI. Significantly, even non-paying users of ChatGPT will be able to experience GPT-4o. ChatGPT Plus users will get increased message limits along with the upgrade, while a new version of Voice Mode is also planned for them.

“GPT-4o is 2x faster, half the price, and has 5x higher rate limits compared to GPT-4 Turbo. We plan to launch support for GPT-4o’s new audio and video capabilities to a small group of trusted partners in the API in the coming weeks,” said OpenAI in its post.

What safeguards are in place for GPT-4o?

As generative AI systems grow more advanced and organic with improved response times, there are fears they will be misused for purposes such as carrying out scam calls, threatening people, impersonating non-consenting individuals, creating false but believable news media, etc.

OpenAI said that GPT-4o had been tested but that the company would continue to investigate risks and address them quickly, apart from limiting certain audio features at launch.

“GPT-4o has safety built-in by design across modalities, through techniques such as filtering training data and refining the model’s behaviour through post-training. We have also created new safety systems to provide guardrails on voice outputs,” said OpenAI, adding that over 70 experts across fields such as social psychology, bias/fairness, and misinformation had carried out red-team testing.

What does GPT-4o have to do with the Hollywood film ‘Her’?

When announcing the launch of GPT-4o, OpenAI CEO Sam Altman posted the word “her” on X.

This was taken to be a reference to the 2013 Hollywood sci-fi romance film written and directed by Spike Jonze, in which the protagonist played by Joaquin Phoenix grows infatuated with an AI assistant played by Scarlett Johansson.

In most of the demo clips shared by OpenAI, GPT-4o’s “voice” sounded female. Unlike more basic iterations, the voices in OpenAI’s latest model were expressive, friendly, and even affectionate, sounding more like a friend – or someone closer – rather than a machine-generated voice.

The GPT-4o voice reacted in typically human ways, such as cooing at an adorable dog, giving a man fashion advice, and guiding a student working on a math problem.