OpenAI unveils o1, a model that can fact-check itself

September 13, 2024

ChatGPT maker OpenAI has announced its next major product release: A generative AI model code-named Strawberry, officially called OpenAI o1.

To be more precise, o1 is actually a family of models. Two are available Thursday in ChatGPT and via OpenAI’s API: o1-preview and o1-mini, a smaller, more efficient model aimed at code generation.

You’ll have to be subscribed to ChatGPT Plus or Team to see o1 in the ChatGPT client. Enterprise and educational users will get access early next week.

Note that the o1 chatbot experience is fairly barebones at present. Unlike GPT-4o, o1’s forebear, o1 can’t browse the web or analyze files yet. The model does have image-analyzing features, but they’ve been disabled pending additional testing. And o1 is rate-limited; weekly limits are currently 30 messages for o1-preview and 50 for o1-mini.

In another downside, o1 is expensive. Very expensive. In the API, o1-preview is $15 per 1 million input tokens and $60 per 1 million output tokens. That’s 3x the cost versus GPT-4o for input and 4x the cost for output. (Tokens are bits of raw data; 1 million is equivalent to around 750,000 words.)

OpenAI says it plans to bring o1-mini access to all free users of ChatGPT but hasn’t set a release date. We’ll hold the company to it.

Chain of reasoning

OpenAI o1 avoids some of the reasoning pitfalls that normally trip up generative AI models because it can effectively fact-check itself by spending more time considering all parts of a question. What makes o1 “feel” qualitatively different from other generative AI models is its ability to “think” before responding to queries, according to OpenAI.

When given additional time to “think,” o1 can reason through a task holistically — planning ahead and performing a series of actions over an extended period of time that help the model arrive at an answer. This makes o1 well-suited for tasks that require synthesizing the results of multiple subtasks, like detecting privileged emails in an attorney’s inbox or brainstorming a product marketing strategy.

In a series of posts on X on Thursday, Noam Brown, a research scientist at OpenAI, said that “o1 is trained with reinforcement learning.” This teaches the system “to ‘think’ before responding via a private chain of thought” through rewards when o1 gets answers right and penalties when it does not, he said.

Brown added that OpenAI used a new optimization algorithm and training dataset containing “reasoning data” and scientific literature specifically tailored for reasoning tasks. “The longer [o1] thinks, the better it does,” he said.

TechCrunch wasn’t offered the opportunity to test o1 before its debut; we’ll get our hands on it as soon as possible. But according to a person who did have access — Pablo Arredondo, VP at Thomson Reuters — o1 is better than OpenAI’s previous models (e.g., GPT-4o) at things like analyzing legal briefs and identifying solutions to problems in LSAT logic games.

“We saw it tackling more substantive, multi-faceted, analysis,” Arredondo told TechCrunch. “Our automated testing also showed gains against a wide range of simple tasks.”

In a qualifying exam for the International Mathematical Olympiad (IMO), a high school math competition, o1 correctly solved 83% of problems while GPT-4o only solved 13%, according to OpenAI. (That’s less impressive when you consider that Google DeepMind’s recent AI achieved a silver medal in an equivalent to the actual IMO contest.) OpenAI also says that o1 reached the 89th percentile of participants — better than DeepMind’s flagship system AlphaCode 2, for what it’s worth — in the online programming challenge rounds known as Codeforces.

In general, o1 should perform better on problems in data analysis, science, and coding, OpenAI says. (GitHub, which tested o1 with its AI coding assistant GitHub Copilot, reports that the model is adept at optimizing algorithms and app code.) And, at least per OpenAI’s benchmarking, o1 improves over GPT-4o in its multilingual skills, especially in languages like Arabic and Korean.

Ethan Mollick, a professor of management at Wharton, wrote his impressions of o1 after using it for a month in a post on his personal blog. On a challenging crossword puzzle, o1 did well, he said — getting all the answers correct (despite hallucinating a new clue).

OpenAI o1 is not perfect

Now, there are drawbacks.

OpenAI o1 can be slower than other models, depending on the query. Arredondo says o1 can take over 10 seconds to answer some questions; it shows its progress by displaying a label for the current subtask it’s performing.

Given the unpredictable nature of generative AI models, o1 likely has other flaws and limitations. Brown admitted that o1 trips up on games of tic-tac-toe from time to time, for example. And in a technical paper, OpenAI said that it’s heard anecdotal feedback from testers that o1 tends to hallucinate (i.e., confidently make stuff up) more than GPT-4o — and less often admits when it doesn’t have the answer to a question.

“Errors and hallucinations still happen [with o1],” Mollick writes in his post. “It still isn’t flawless.”

We’ll no doubt learn more about the various issues in time, and once we have a chance to put o1 through the wringer ourselves.

Fierce competition

We’d be remiss if we didn’t point out that OpenAI is far from the only AI vendor investigating these types of reasoning methods to improve model factuality.

Google DeepMind researchers recently published a study showing that by essentially giving models more compute time and guidance to fulfill requests as they’re made, the performance of those models can be significantly improved without any additional tweaks.

Illustrating the fierceness of the competition, OpenAI said that it decided against showing o1’s raw “chains of thoughts” in ChatGPT partly due to “competitive advantage.” (Instead, the company opted to show “model-generated summaries” of the chains.)

OpenAI might be first out of the gate with o1. But assuming rivals soon follow suit with similar models, the company’s real test will be making o1 widely available — and for cheaper.

From there, we’ll see how quickly OpenAI can deliver upgraded versions of o1. The company says it aims to experiment with o1 models that reason for hours, days, or even weeks to further boost their reasoning capabilities.

Source link