TAI 131: OpenAI’s o3 Passes Human Experts; LLMs Accelerating With Inference Compute Scaling
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
OpenAI wrapped up its “12 Days of OpenAI” campaign and saved the best till last with the reveal of its o3 and o3-mini reasoning models. These models are successors to the o1 series and are debatably the largest step change improvement yet in LLM capabilities on complex tasks — for the first time eclipsing human experts in many domains. The o3 release drowned out the otherwise significant launch of Google Gemini’s 2.0 Flash Thinking Mode model — its first reasoning model (in the style of o1/o3) — which, unlike OpenAI, doesn’t hide its thinking tokens.
There is a huge amount to unpack in the o3 release — the model sailed past human expert scores on many key advanced benchmarks — including coding, mathematics, and PhD science. Perhaps most noteworthy was the breakthrough on the ARC-AGI benchmark (where LLMs have traditionally failed and only achieved average scores even with heavy scaffolding and brute force) — for example, o3 (low efficiency) achieved 87.5% vs o1 32% just a week earlier and GPT4o at 5% in May. This score is considered human-level, further fueling debates over whether o3 edges closer to Artificial General Intelligence (AGI). Some of the best scores do come at a huge cost; however — o3 on low-efficiency mode (1,024 samples) costs around $3,400 per task — costing 160x vs. $20 for o3 high efficiency (6 samples and achieved 75.7%) and vs. ~$3 for o1.
On the GPQA Diamond test — designed for PhD-level science questions — o3 scored 87.7%, compared to the 78% achieved by o1. For context, PhD holders with internet access typically score between 34% (outside their specialty) and 81% (within their domain). In coding, o3’s Elo rating of 2727 on Codeforces puts it in the 99.95th percentile of competitive programmers, far exceeding the reach of most human professionals. Mathematics is another area where o3 shines, achieving 96.7% accuracy on the American Invitational Mathematics Exam (AIME), up from o1’s 83.3% and just 13.4% for 4o only months earlier.
This release didn’t only come with a huge cost 1,000x escalation for some tasks — but also the promise of huge cost savings! Due to success with model distillation and other techniques, the o3-mini outperforms the much larger o1 model released just last week on many coding and maths tasks. For example, o3-mini with medium compute achieved a much stronger Codeforce Elo in 1997 vs. o1 in 1891, but at what we eyeball as a ~70–80% lower total cost.
How do the models work? OpenAI still hasn’t disclosed that they use reinforcement learning to improve the model’s reasoning during training. However, employees have posted that they are still just LLMs and use autoregression. We think the model is trained to be highly efficient at chain-of-thought reasoning — exploring the most likely paths and realizing when it has made a mistake. We think the rapid progress in just 3 months between o1 and o3 is likely primarily from using synthetic data from o1’s full chain of thought thinking tokens to add to the reinforcement learning dataset used for training. On the other hand, we expect the initial o1 mostly used a smaller set of human expert commissioned reasoning examples (which are missing from pre-training because people almost never type out their full internal monologue and reasoning process and instead skip to the answers!). It is also possible that o3 was built using a different, more advanced base foundation model (o1 likely used 4o) — perhaps GPT-4.5 or a checkpoint of the rumored Orion or GPT-5 model leading to additional benefits.
One interesting note on the new regime of “inference time” compute scaling — is that OpenAI appears to be scaling thinking tokens both in series (up to ~100k reasoning tokens in its context window) — but also in parallel — with 6 (high efficiency) or 1024 samples (low efficiency) used in the ARC-AGI evaluation. It is unclear how the best answer is chosen from these — it could be simple majority voting, but more likely, there is complexity and extra secret sauce here in how the best samples are automatically and rapidly searched, evaluated, and chosen. We think it is possible some form of this parallel scaling could also be taking place in the o1-Pro model available (within the $200/month ChatGPT Pro).
OpenAI models rapid breakthroughs on complex benchmarks this year:
The models have not yet been released, and the rollout schedule is still dependent on safety testing. o3-mini is slated for release in late January 2025, with o3 following shortly after. Researchers can apply for early access to test the models, with an application deadline of January 10th, 2025. Pricing has also yet to be announced.
Why should you care?
So what does this all mean? LLMs can now perform to human expert standards at many tasks — and these breakthroughs were achieved at an accelerating pace. Will the inference time compute scaling paradigm continue to deliver new generations every 3 months relative to the 1–2 years for the training time scaling regime? How will these models perform in the real world beyond their benchmarks? Will o3 models rapidly begin to transform the global economy and disrupt huge numbers of jobs, or is the cost too large a bottleneck to adoption? On which tasks will it be worth spending 170x more compute for incrementally better performance (as with Arc-AGI)? Is this model AGI already? Do you need to find a new career?
While we don’t think this model is AGI yet (which has wildly differing definitions in any case), we think this model is hugely significant and should be on the front page of all newspapers. It suggests that deep learning and the LLM paradigm don’t have any obvious limits. Far from the slowdown and failures of new model generations covered in the media — progress is faster than it has ever been on the most complex benchmarks. My key takeaway is that if we can develop a benchmark or generate a few or a few hundred detailed reasoning examples for a task category of human work, we can solve it together with extra synthetic reasoning data. (This doesn’t yet apply to physical labor, but AI-based robotics are also rapidly progressing!). The price of o3 will be a large barrier initially — but we expect large improvements in the cost and particularly the efficiency of running parallel “samples.” The o3-mini also appears to be a game changer; however, the huge cost savings will likely come at the cost of more narrow capabilities.
To achieve products with high enough reliability and affordability for mass adoption — we still think a large amount of work will be needed from LLM Developers to optimize and customize these models to specific industries and niche tasks — including gathering industry-specific data, creating reasoning data, and creating your own evaluations. With Google Gemini also joining the reasoning model race this week and with open-source reasoning models from Alibaba Qwen and Deepseek in China, we expect competition to drive affordability and developer customization options for these models. OpenAI has already announced it will release reinforcement learning-based reasoning fine-tuning options, and we think, eventually, there will also be reasoning model distillation options to customize larger models into smaller forms. So there is no better time to convert to become an LLM Developer with our own 80+ lesson Python course and learn to harness these models!
— Louie Peters — Towards AI Co-founder and CEO
Hottest News
OpenAI announced OpenAI o3, the latest model in its o-Model Reasoning Series. Building on its predecessors, o3 showcases huge leaps in mathematical and scientific reasoning, prompting discussions about its capabilities and constraints.
Elon Musk’s xAI announced it raised $6 billion in a Series C funding round, bringing its value to more than $40 billion. The company said the funding would be allocated to products and infrastructure, including its Grok AI model and the multibillion-dollar supercomputer site used to train its AI models. The Colossus supercomputer scaled to 100,000 NVIDIA Hopper GPUs in record time and plans to soon add another 100k.
3. OpenAI Is Offering 1 Million Free Tokens for GPT-4o and o1
A user on X highlighted that OpenAI seems to be offering 1 million free tokens for GPT-4o and o1 if you share your API usage with them for training. Users can get up to 10 million tokens per day on traffic shared with OpenAI on smaller models. This is similar to Google Gemini’s free tier strategy for its API, where data can be used for training. We think the race for user data has become even more critical given the success of reasoning models where OpenAI could use thinking tokens from user o1 model prompts to expand its reinforcement learning data sets.
4. Google Releases Its Own ‘Reasoning’ AI Model
Google has released Gemini 2.0 Flash Thinking Mode, an experimental model trained to generate the “thinking process” the model goes through as part of its response. Thinking models are available in Google AI Studio and through the Gemini API.
5. Microsoft AI Research Open-Sources PromptWizard
Researchers from Microsoft Research India have developed and open-sourced PromptWizard, an innovative AI framework for optimizing prompts in black-box LLMs. This framework employs a feedback-driven critique-and-synthesis mechanism to iteratively refine prompt instructions and in-context examples, enhancing task performance. PromptWizard operates through two primary phases: a generation phase and a test-time inference phase.
6. The Technology Innovation Institute in Abu Dhabi Released the Falcon 3 Family of Models
The UAE government-backed Technology Innovation Institute (TII) has announced the launch of Falcon 3, a family of open-source small language models (SLMs) designed to run efficiently on lightweight, single GPU-based infrastructures. Falcon 3 features four model sizes — 1B, 3B, 7B, and 10B — with base and instruction variants. According to the Hugging Face leaderboard, the models are already outperforming or closely matching popular open-source counterparts in their size class, including Meta’s Llama and category leader Qwen-2.5.
7. Salesforce Drops Agentforce 2.0
Salesforce announced Agentforce 2.0: the newest version of Agentforce, the first digital labor platform for enterprises. This release introduces a new library of pre-built skills and workflow integrations for rapid customization, the ability to deploy Agentforce in Slack, and advancements in agentic reasoning and retrieval-augmented generation (RAG).
8. Patronus AI Open Sources Glider: A 3B State-of-the-Art Small Language Model (SLM) Judge
Patronus AI has introduced Glider, a general-purpose 3.8B evaluation model. This open-source evaluator model provides quantitative and qualitative feedback for text inputs and outputs. It acts as a fast, inference-time guardrail for LLM systems, offering detailed reasoning chains and highlighting key phrases to enhance interpretability. Glider is built upon the Phi-3.5-mini-instruct base model and has been fine-tuned on diverse datasets spanning 685 domains and 183 evaluation criteria.
Five 5-minute reads/videos to keep you learning
1. Alignment Faking in Large Language Models
Alignment faking is where someone appears to share our views or values but is, in fact, only pretending to do so. A new paper from Anthropic’s Alignment Science team, in collaboration with Redwood Research, provides the first empirical example of a large language model engaging in alignment faking without having been explicitly trained or instructed to do so.
2. AI Safety on a Budget: Your Guide to Free, Open-Source Tools for Implementing Safer LLMs
This blog shares some free AI safety tools. It shares everything you need to know, from guardrails that steer chatbots away from disaster to datasets that help identify toxic content. It also provides insights into the AI safety landscape and how to navigate it, especially on a budget.
This video explains why and when you should fine-tune your LLM in a RAG system. This concept is useful for today’s AI engineers playing with LLMs.
4. The Real Reason Your Company’s AI Isn’t Working (Hint: It’s Not the Technology)
The underlying reason many companies struggle to make AI tools work is not the technology itself. The real challenge lies in organizational structures, cultural resistance, a lack of proper training, and insufficient time allocated for exploration. This article presents some thoughts on addressing these issues, such as investing in leadership support, encouraging cultural change, offering tailored training sessions, and fostering an environment of experimentation.
5. Introducing ReACT LLM Agents: A Secret to More Capable AI
A ReACT agent is a special type of AI agent that uses both Reasoning and Acting to solve the tasks or problems we assign. This article explores this concept, presents use case examples, and explains how it has the potential to make AI more capable.
Repositories & Tools
- Anthropic Cookbook provides code and guides designed to help developers build with Claude.
- Genesis is a physics platform for general-purpose robotics/embodied AI/physical AI applications.
- Picotron is a minimalist repository for pre-training Llama-like models with 4D Parallelism.
- Helicone is an open-source LLM observability platform.
Top Papers of The Week
This report introduces Qwen2.5, a comprehensive series of LLMs designed to meet diverse needs. Compared to previous iterations, Qwen 2.5 has significantly improved during both the pre-training and post-training stages. The pre-training dataset has been scaled from the previous 7 trillion tokens to 18 trillion tokens, and the post-training implements intricate supervised finetuning with over 1 million samples and multistage reinforcement learning.
2. Byte Latent Transformer: Patches Scale Better Than Tokens
This paper introduces the Byte Latent Transformer (BLT), a new byte-level LLM architecture that matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it.
3. Deliberative Alignment: Reasoning Enables Safer Language Models
This paper introduces deliberative alignment, a training paradigm that directly teaches reasoning LLMs the text of human-written and interpretable safety specifications. It trains them to reason explicitly about these specifications before answering. Open AI used deliberative alignment to align OpenAI’s o-series models, enabling them to use chain-of-thought (CoT) reasoning to reflect on user prompts, identify relevant text from OpenAI’s internal policies, and draft safer responses.
4. Fully Open Source Moxin-7B Technical Report
This paper introduces Moxin 7B, a fully open-source LLM developed in accordance with the Model Openness Framework (MOF). The MOF is a ranked classification system that evaluates AI models based on model completeness and openness, adhering to the principles of open science, open source, open data, and open access. Experiments show that the model performs better in zero-shot evaluation than popular 7B models.
5. RAGBench: Explainable Benchmark for Retrieval-Augmented Generation Systems
This paper introduces RAGBench, a comprehensive, large-scale RAG benchmark dataset of 100k examples. It covers five unique industry-specific domains and various RAG task types. RAGBench examples are sourced from industry corpora, such as user manuals, making it particularly relevant for industry applications.
6. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
This paper presents an improved version of CosyVoice (streaming speech synthesis model), CosyVoice 2, which incorporates comprehensive and systematic optimizations. It introduces finite-scalar quantization to improve the codebook utilization of speech tokens and streamlines the model architecture to allow direct use of a pre-trained LLM. Additionally, it also uses a chunk-aware causal flow matching model to support various synthesis scenarios.
Quick Links
1. OpenAI brings ChatGPT to your landline. Call 1–800–242–8478, and OpenAI’s AI-powered assistant will respond as of Wednesday afternoon. The experience is more or less identical to Advanced Voice Mode. ChatGPT responds to the questions users ask over the phone and can handle tasks such as translating a sentence into a different language.
2. Google is expanding Gemini’s latest in-depth research mode to 40 more languages. The company launched the in-depth research mode earlier this month, allowing Google One AI premium plan users to unlock an AI-powered research assistant.
3. GitHub has launched GitHub Copilot Free, an accessible version of its popular AI-powered coding assistant — with limits. The new free tier for VS Code aims to expand the AI-powered code completion assistant’s reach to a broader audience of developers — namely, those with only light usage needs and tighter budgets.
Who’s Hiring in AI
Applied AI Finetuning Engineer @Anthropic (Multiple US locations)
Generative AI for Test Case Generation — Master Thesis Opportunity @IBM (Frankfurt/Germany)
Generative AI Engineer @CAI (Remote)
AI Strategist @Navy Federal Credit Union (Multiple US locations)
New College Grad, Hardware Integration Engineer @Western Digital (San Jose, CA, USA)
Software Development Engineer @Siemens Digital Industries Software (New Cairo, Al Qahirah, Egypt)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI