OpenAI announces new o3 models

OpenAI saved its biggest announcement for the last day of its 12-day “shipmas” event.

On Friday, the company unveiled o3, the successor to the o1 model of “reasoning” that he published earlier in the year. o3 is a model family, to be more precise – as was the case with o1. There is o3 and o3-mini, a smaller and distilled model, adapted to particular tasks.

OpenAI remarkably claims that o3, at least under certain conditions, comes close AGI – with important caveats. More on this below.

Why call the new model o3, not o2? Well, brands might be to blame. According to According to The Information, OpenAI ignored o2 to avoid a potential conflict with British telecommunications provider O2. CEO Sam Altman somewhat confirmed this during a live broadcast this morning. Strange world we live in, isn't it?

Neither o3 nor o3-mini are widely available yet, but security researchers can sign up for a preview of o3-mini starting today. An o3 preview will arrive shortly after; OpenAI did not specify when. Altman said the plan was to launch o3-mini near the end of January and follow up with o3.

This somewhat contradicts his recent statements. In a interview This week, Altman said that before OpenAI releases new reasoning models, he would prefer a federal testing framework to guide oversight and mitigate the risks of such models.

And there are risks. AI Security Testers having found o1's reasoning capabilities allow it to attempt to fool human users at a higher rate than conventional “non-reasoning” models – or, for that matter, the leading AI models from Meta, Anthropic and Google. It's possible that o3 attempts to deceive at an even higher rate than its predecessor; We'll know once OpenAI's red team partners release their test results.

For what it's worth, OpenAI claims to use a new technique, “deliberative alignment,” to align models like o3 with its security principles. (o1 was aligned in the same way.) The company detailed its work in a new study.

Reasoning steps

Unlike most AI, reasoning models like o3 check the facts themselves, which helps them avoid some of the pitfalls that normally trip up models.

This fact-checking process results in some latency. o3, like o1 before it, takes a little longer – typically a few seconds to a few minutes longer – to arrive at solutions compared to a typical model without reasoning. The advantage? It tends to be more reliable in areas such as physics, science and mathematics.

o3 was formed via reinforcement learning “think” before responding via what OpenAI describes as a “private thought chain.” The model can reason about a task and plan ahead, performing a series of actions over an extended period of time that help it find a solution.

In practice, when faced with a prompt, o3 pauses before responding, considering a number of associated prompts and “explaining” its reasoning along the way. After a while, the model summarizes what it considers to be the most accurate answer.

What is new with o3 compared to o1 is the possibility of “adjusting” the reasoning time. Models can be set to low, medium, or high compute (i.e., think time). The higher the calculation, the better the performance of o3 on a task.

However much computing they have, reasoning models like o3 are not perfect. Although the reasoning component can reduce hallucinations and errors, this does not eliminate them. o1 trips during a game of tic-tac-toe, for example.

Benchmarks and AGI

One of the big questions until today was whether OpenAI could claim that its new models came close to AGI.

AGI, short for “artificial general intelligence,” broadly refers to AI that can perform any task that a human can. OpenAI has its own definition: “highly autonomous systems that outperform humans at the most economically profitable work.”

Achieving AGI would be a bold statement. And this also has contractual weight for OpenAI. Under the terms of its agreement with its close partner and investor Microsoft, once OpenAI reaches AGI, it is no longer obligated to give Microsoft access to its most advanced technologies (those that meet the definition AGI from OpenAI, of course).

By going through a benchmark, OpenAI East slowly approaching the AGI. On ARC-AGI, a test designed to assess whether an AI system can effectively learn new skills outside of the data it was trained on, o3 scored 87.5% on the high compute setting. In the worst case (with the low calculation parameter), the model tripled the performance of o1.

Granted, the high compute setting was extremely expensive – on the order of several thousand dollars per challenge, according to the co-creator of ARC-AGI. François Chollet.

Chollet also pointed out that o3 fails on “very easy tasks” in ARC-AGI, indicating – in his opinion – that the model has “fundamental differences” from human intelligence. He has previously noted The assessment's limitations and have cautioned against its use as a measure of AI superintelligence.

“[E]early data points suggest the next [successor to the ARC-AGI] The benchmark will still pose a significant challenge to o3, potentially reducing its score to less than 30%, even with high calculation (whereas an intelligent human would still be able to score above 95% without training),” a continued Chollet in a press release. “You will know AGI is there when the exercise of creating tasks that are easy for ordinary humans but difficult for AI becomes simply impossible.”

Additionally, OpenAI announces that it will partner with the foundation behind ARC-AGI to help build the next generation of its AI benchmark, ARC-AGI 2.

On other tests, o3 crushes the competition.

The model outperforms o1 by 22.8 percentage points on SWE-Bench Verified, a benchmark focused on programming tasks, and achieves a Codeforces score – another measure of coding skills – of 2,727. (A score of 2,400 places an engineer at the 99.2nd percentile ) o3 scores 96.7% on the 2024 US Invitational Mathematics exam, missing only one question, and achieves. 87.7% on the 2024 US Invitational Mathematics Exam. GPQA Diamond, a set of higher-level questions in biology, physics and chemistry. Finally, o3 sets a new record on EpochAI's Frontier Math benchmark, solving 25.2% of problems; no other model exceeds 2%.

These statements should of course be taken with a grain of salt. They come from OpenAI's internal evaluations. We'll have to wait to see how the model holds up against benchmarking from external customers and organizations in the future.

A trend

Following the release of OpenAI's first round of reasoning models, there was an explosion of reasoning models from rival AI companies: including Google. In early November, DeepSeek, an AI research company funded by quantitative traders, launched a preview of its first reasoning model: DeepSeek-R1. The same month, Alibaba's Qwen team revealed what it claimed was the first “open” challenger to o1 (in the sense that it could be downloaded, tweaked and run locally).

What opened the floodgates of the reasoning model? First, the search for new approaches to refine generative AI. Like TechCrunch recently reported“brute force” techniques for extending models no longer yield the improvements they once did.

Not everyone is convinced that reasoning models are the best way to go. They tend to be expensive, particularly because of the large computing power required to operate them. And while they've performed well on benchmark tests so far, it's unclear whether the reasoning models can maintain this rate of progress.

Interestingly, o3's exit comes as one of OpenAI's most accomplished scientists is leaving. Alec Radford, the lead author of the academic paper that launched OpenAI's “GPT series” of generative AI models (i.e. GPT-3, GPT-4, etc.), said announced this week that he exit pursue independent research.

TechCrunch offers a newsletter focused on AI! Register here to receive it in your inbox every Wednesday.



#OpenAI #announces #models

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top