OpenAI released its new o1 models on Thursday, giving ChatGPT users their first chance to try AI models that pause to “think” before they answer. There’s been a lot of hype building up to these models, codenamed “Strawberry” inside OpenAI. But does Strawberry live up to the hype?
Sort of.
Compared to GPT-4o, the o1 models feel like one step forward and two steps back. OpenAI o1 excels at reasoning and answering complex questions, but the model is roughly four times more expensive to use than GPT-4o. OpenAI’s latest model lacks the tools, multimodal capabilities, and speed that made GPT-4o so impressive. In fact, OpenAI even admits that “GPT-4o is still the best option for most prompts” on its help page, and notes elsewhere that o1 struggles at simpler tasks.
“It’s impressive, but I think the improvement is not very significant,” said Ravid Shwartz Ziv, an NYU professor who studies AI models. “It’s better at certain problems, but you don’t have this across-the-board improvement.”
For all of these reasons, it’s important to use o1 only for the questions it’s truly designed to help with: big ones. To be clear, most people are not using generative AI to answer these kinds of questions today, largely because today’s AI models are not very good at it. However, o1 is a tentative step in that direction.
Thinking through big ideas
OpenAI o1 is unique because it “thinks” before answering, breaking down big problems into small steps and attempting to identify when it gets one of those steps right or wrong. This “multi-step reasoning” isn’t entirely new (researchers have proposed it for years, and You.com uses it for complex queries), but it hasn’t been practical until recently.
“There’s a lot of excitement in the AI community,” said Workera CEO and Stanford adjunct lecturer Kian Katanforoosh, who teaches classes on machine learning, in an interview. “If you can train a reinforcement learning algorithm paired with some of the language model techniques that OpenAI has, you can technically create step-by-step thinking and allow the AI model to walk backwards from big ideas you’re trying to work through.”
OpenAI o1 is also uniquely pricey. In most models, you pay for input tokens and output tokens. However, o1 adds a hidden process (the small steps the model breaks big problems into), which adds a large amount of compute you never fully see. OpenAI is hiding some details of this process to maintain its competitive advantage. That said, you still get charged for these in the form of “reasoning tokens.” This further emphasizes why you need to be careful about using OpenAI o1, so you don’t get charged a ton of tokens for asking where the capital of Nevada is.
The idea of an AI model that helps you “walk backwards from big ideas” is powerful, though. In practice, the model is pretty good at that.
In one example, I asked ChatGPT o1 preview to help my family plan Thanksgiving, a task that could benefit from a little unbiased logic and reasoning. Specifically, I wanted help figuring out if two ovens would be sufficient to cook a Thanksgiving dinner for 11 people and wanted to talk through whether we should consider renting an Airbnb to get access to a third oven.
After 12 seconds of “thinking,” ChatGPT wrote me out a 750+ word response ultimately telling me that two ovens should be sufficient with some careful strategizing, and will allow my family to save on costs and spend more time together. But it broke down its thinking for me at each step of the way and explained how it considered all of these external factors, including costs, family time, and oven management.
ChatGPT o1 preview told me how to prioritize oven space at the house that is hosting the event, which was smart. Oddly, it suggested I consider renting a portable oven for the day. That said, the model performed much better than GPT-4o, which required multiple follow-up questions about what exact dishes I was bringing, and then gave me bare-bones advice I found less useful.
Asking about Thanksgiving dinner may seem silly, but you could see how this tool would be helpful for breaking down complicated tasks.
I also asked o1 to help me plan out a busy day at work, where I needed to travel between the airport, multiple in-person meetings in various locations, and my office. It gave me a very detailed plan, but maybe was a little bit much. Sometimes, all the added steps can be a little overwhelming.
For a simpler question, o1 does way too much — it doesn’t know when to stop overthinking. I asked where you can find cedar trees in America, and it delivered an 800+ word response, outlining every variation of cedar tree in the country, including their scientific name. It even had to consult with OpenAI’s policies at some point, for some reason. GPT-4o did a much better job answering this question, delivering me about three sentences explaining you can find the trees all over the country.
Tempering expectations
In some ways, Strawberry was never going to live up to the hype. Reports about OpenAI’s reasoning models date back to November 2023, right around the time everyone was looking for an answer about why OpenAI’s board ousted Sam Altman. That spun up the rumor mill in the AI world, leaving some to speculate that Strawberry was a form of AGI, the enlightened version of AI that OpenAI aspires to ultimately create.
Altman confirmed o1 is not AGI to clear up any doubts, not that you’d be confused after using the thing. The CEO also trimmed expectations around this launch, tweeting that “o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.”
The rest of the AI world is coming to terms with a less exciting launch than expected.
“The hype sort of grew out of OpenAI’s control,” said Rohan Pandey, a research engineer with the AI startup ReWorkd, which builds web scrapers with OpenAI’s models.
He’s hoping that o1’s reasoning ability is good enough to solve a niche set of complicated problems where GPT-4 falls short. That’s likely how most people in the industry are viewing o1, but not quite as the revolutionary step forward that GPT-4 represented for the industry.
“Everybody is waiting for a step function change for capabilities, and it is unclear that this represents that. I think it’s that simple,” said Brightwave CEO Mike Conover, who previously co-created Databricks’ AI model Dolly, in an interview.
What’s the value here?
The underlying principles used to create o1 go back years. Google used similar techniques in 2016 to create AlphaGo, the first AI system to defeat a world champion of the board game Go, former Googler and CEO of the venture firm S32, Andy Harrison, points out. AlphaGo trained by playing against itself countless times, essentially self-teaching until it reached superhuman capability.
He notes that this brings up an age-old debate in the AI world.
“Camp one thinks that you can automate workflows through this agentic process. Camp two thinks that if you had generalized intelligence and reasoning, you wouldn’t need the workflow and, like a human, the AI would just make a judgment,” said Harrison in an interview.
Harrison says he’s in camp one and that camp two requires you to trust AI to make the right decision. He doesn’t think we’re there yet.
However, others think of o1 as less of a decision-maker and more of a tool to question your thinking on big decisions.
Katanforoosh, the Workera CEO, described an example where he was going to interview a data scientist to work at his company. He tells OpenAI o1 that he only has 30 minutes and wants to asses a certain number of skills. He can work backward with the AI model to understand if he’s thinking about this correctly, and o1 will understand time constraints and whatnot.
The question is whether this helpful tool is worth the hefty price tag. As AI models continue to get cheaper, o1 is one of the first AI models in a long time that we’ve seen get more expensive.