TAI #142: GPT-4.5 Released — But Can It Stack Up Against Reasoning Models?
Last Updated on March 4, 2025 by Editorial Team
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
This week’s GPT-4.5 release landed with predictable excitement but, perhaps more tellingly, also sparked some debate. Despite being OpenAI’s largest and most expensive model to date, GPT-4.5 didn’t exactly wow everyone. Compared to recent leaps in LLM reasoning capability from OpenAI’s own o3 model and Grok-3, GPT-4.5 feels incremental. GPT-4.5 certainly has a touch of something extra, particularly showing improved creative writing, humour and emotional intelligence, but its strengths are hard to place and on many tasks it is a mixed bag relative to the heavily optimised 4o and thinking token powered o1-pro. In contrast to recent progress leaps we have seen from reinforcement learning with verifiable rewards, 4.5 sits firmly within expected pre-training compute scaling laws rather than leapfrogging them.
In other news, this week Deepseek has been busy open-sourcing a very impressive and valuable set of internal tools and code optimizations for training and inferencing LLMs. More on this below! Deepseek also reported an impressive weekly rate of 5,432 billion LLM tokens processed per week across both its Chatbot and API for both v3 and r1 models (4,256bn input, 2,394bn input with cache hit, 1,176bn output tokens). Due to many of the inference optimizations they shared, Deepseek managed to achieve this on a cluster of just 2,200 H800 GPUs. They added that if all these tokens were charged at their r1 pricing it would have made an 84.5% margin. This is not a realistic scenario in practice as v3 usage is less compute-heavy than r1 while its App is free, but this still looks like strong positive margins. It suggests Deepseek’s incredibly vertically integrated and customized inference stack leads to greater inference efficiency than closed US AI labs.
Back to GPT-4.5: GPT-4.5 notably lacks RL reasoning techniques introduced in models like o1 and o3. As a result, while GPT-4.5 likely boasts the best general writing abilities and humor of any existing model, it trails in math and STEM benchmarks, falling short of o3’s reasoning prowess and Grok-3’s boosted Thinking mode. This makes GPT-4.5 the king of prose, but leaves the door open for competition on many tasks. GPT-4.5 is also a mixed performer even compared only against other non-reasoning models, however, it did briefly rise to the top of LmArena (only to be quickly overtaken again by an updated Grok-3). On GPQA, GPT-4.5 trails Grok 3 Beta (75.4%) with its score of 71.4%, though it surpasses Claude 3.7 (68.0%) and DeepSeek v3 (59.1%). In challenging math-oriented tests like AIME ’24, GPT-4.5 (36.7%) is again behind Grok 3 (52.2%) and even DeepSeek-V3 (39.2%), though still significantly ahead of Claude 3.7 (23.3%). On practical coding tasks evaluated by SWE-Bench Verified, GPT-4.5 (38.0%) again trails Claude 3.7 (62.3%) and DeepSeek v3 (42.0%).
In practical terms, GPT-4.5 rolls out initially only to ChatGPT Pro subscribers ($200/month) and API developers at eye-watering prices — $75 per million input tokens and $150 per million output tokens — 15–30x GPT-4o’s cost. It’s a costly release, reflecting the massive computational infrastructure powering this model. Notably, despite its size and scale, OpenAI hasn’t classified GPT-4.5 as a frontier model, meaning it doesn’t possess capabilities judged high-risk or transformative — adding another wrinkle to interpreting this release.
Just how big actually is GPT-4.5? OpenAI is not open enough to disclose this, but GPT-4.5 API prices are 2.5x higher even than the original GPT-4 release, despite huge gains in GPU and inference efficiency techniques in the past 2 years, suggesting even greater increase in active parameters. (GPT-4 is widely thought to be a 1.8Tn parameter Mixture of Experts with ~ 280bn active parameters). Training compute is suggested to be in the ballpark of 10x that of GPT-4, but some of this increase is spent on scaling training data rather than just increased model size. New training efficiency techniques were also used, which makes estimates difficult. Nevertheless, it is fair to say from pricing and latency that this is the largest model ever made available.
Why did OpenAI opt to release GPT-4.5 without pushing hard into reasoning techniques like RLVR? Likely due to computational cost for such a large model and strategic simplification ahead of the GPT-5 “unified intelligence” model, expected later this year. GPT-4.5 marks the end of OpenAI’s non-chain-of-thought models — implying that the company’s true leap forward, the integration of large-scale pre-training and reinforcement learning is still to come. The emphasis for 4.5 now shifts to improving inference efficiency, where innovations like Deepseek’s openly shared inference optimizations may prove beneficial!
Why should you care?
While GPT-4.5 might not find many practical uses at its current price point, it sets a foundation upon which the next generation could build. Given an increasingly fragmented state of current LLM SotA scores across benchmarks and tasks, the path forward seems increasingly clear: A truly breakthrough next-generation LLM made using only already proven wins could integrate GPT-4.5’s parameter scale, Grok-3’s training cluster muscle, o3’s reinforcement-learning backed reasoning (RL for o3 was scaled much further than for o1 or Grok-3 and Sonnet 3.7 thinking modes), Deepseek and Sonnet 3.7’s data curation and architecture finesse, Gemini’s huge context length and Deepseek’s unparalleled inference efficiency. In other words, each contender has carved out its own expertise — none has fully assembled all components yet.
On top of foundation model improvements, there is still a huge amount of work for LLM Developers to do to build practical implementations and customizations using these models to create advanced LLM pipelines and Agents! If you don’t yet know how to code but wish to join this AI effort — it’s never been quicker to learn and we can help with our new “AI native” Python course for complete software novices.
— Louie Peters — Towards AI Co-founder and CEO
Our ‘Python Primer for Generative AI’ Course Is Here!
Learn Python the LLM-Native Way
Most Python courses teach you syntax and theory before you build anything useful. But we think anyone can start building something from day one — and use LLMs to make the learning process faster and more intuitive.
We have designed Python Primer for Generative AI, our newest course, with exactly that in mind. This is the first Python course designed specifically for LLM development that helps you focus on what you want to build, not on every arcane detail of programming. It teaches you how to think, build, and solve problems like an AI engineer — right from day one.
This course follows a bottom-up, build-first approach, meaning you’ll be writing Python from the very first lesson, gaining hands-on experience with industry-standard tools like Hugging Face.
- Learn Python by building real AI applications — Every concept is tied to a practical, real-world use case.
- Use LLMs as your personal coding assistants — Learn to use LLMs to generate code, iterate on your project and ask LLMs the right questions to explain the code and accelerate your learning process.
- Think like an AI engineer — Develop critical problem-solving skills that extend beyond just coding.
We believe the future of AI belongs to LLM developers. If you’re serious about creating with LLMs, it’s time to start mastering Python specifically for LLMs.
Build your foundation today — Join the Course.
If you are already familiar with Python, you can start your journey to becoming a certified LLM Developer right away with our From Beginner to Advanced LLM Developer course. Or, if you’re ready to fully commit to mastery, take advantage of our exclusive bundle offer and save over $125.
Hottest News
1. OpenAI Unveils GPT-4.5 ‘Orion,’ Its Largest AI Model Yet
OpenAI unveiled GPT-4.5, its latest LLM, featuring reduced hallucinations and enhanced conversational abilities. Initially available to ChatGPT Pro users, GPT-4.5 emphasizes unsupervised learning for pattern recognition and creativity. Despite improvements, the model’s high API cost and varying performance compared to rivals raise questions about its value in competitive AI markets.
2. DeepSeek Open Source Week: A Complete Summary
During Open Source Week, Deepseek released five cutting-edge repositories: FlashMLA (an MLA decoding kernel for Hopper GPUs), DeepEP (a Communication library for MoE models), DeepGEMM (an Optimized General Matrix Multiplication library), Fire-Flyer File System (a Distributed file system for ML workflows), and DeepSeek-V3/R1 Inference System (a Large-scale inference system using cross-node Expert Parallelism).
3. Microsoft AI Releases Phi-4-Multimodal and Phi-4-Mini
Microsoft expanded its Phi language models by introducing Phi-4-mini and Phi-4-multimodal, optimized for multimodal processing and efficiency. Phi-4-mini uses a decoder-only transformer and GQA, excelling in tasks with 3.8 billion parameters. Phi-4-multimodal, with 5.6 billion parameters and Mixture of LoRAs, surpasses competitors in visual and audio benchmarks, available on Hugging Face under an MIT license.
4. Apple’s Artificial Intelligence Efforts Reach a Make-or-Break Point
Apple is struggling to rebuild Siri for the age of generative AI and the company might not release “a true modernized, conversational version of Siri” until iOS 20 comes out in 2027. That doesn’t mean there won’t be big Siri updates before then. A new version of Siri will reportedly debut in May — finally incorporating all the Apple Intelligence features that the company announced nearly a year earlier.
5. Amazon Introduced a New Alexa Powered by Anthropic and Nova Models
Amazon revealed Alexa+, an enhanced AI-powered assistant, during a New York event. This upgraded Alexa integrates into smart homes, comprehends user preferences, and assists with various tasks like scheduling and security monitoring. Alexa+ harnesses generative AI for contextual and personalized responses, supporting productivity by managing documents and emails efficiently, with a release slated for later this year.
6. OpenAI Plans To Bring Sora’s Video Generator to ChatGPT
OpenAI intends to eventually integrate its AI video generation tool, Sora, directly into its popular consumer chatbot app, ChatGPT, company leaders said during an office hours session on Discord. The version of Sora that ultimately comes to ChatGPT may not offer the same level of control compared to Sora’s web app.
7. ElevenLabs Is Launching Its Own Speech-to-Text Model
ElevenLabs is releasing a stand-alone speech detection model. The Scribe model supports over 99 languages and features character-level timestamps, speaker diarization, and audio-event tagging. However, Scribe currently only works with pre-recorded audio formats. The company said it will soon release a low-latency real-time version of the model.
Five 5-minute reads/videos to keep you learning
1. Decoding OpenAI’s Advanced Reasoning Models: A Gentle Introduction to How They Work
This article introduces OpenAI’s advanced reasoning models, such as o1 and o3, which are designed to improve logical thinking and problem-solving in AI systems. Unlike traditional language models that predict words sequentially, these models generate intermediate reasoning steps before reaching a final answer. This structured approach enhances their performance in complex domains like mathematics, coding, and scientific reasoning, making them more effective at tackling intricate problems.
2. Forecasting Rare Language Model Behaviors
Anthropic’s research on forecasting rare language model behaviors explores how to predict infrequent but potentially harmful AI responses that may emerge at scale. By analyzing patterns in model outputs, they discovered that harmful behaviors follow a power law distribution. This allows researchers to estimate their likelihood even if they are not directly observed in small-scale testing. This method improves safety by enabling proactive risk mitigation before large-scale deployment.
3. Introduction to CUDA Programming for Python Developers
GPUs, with their thousands of cores, excel in parallel processing, ideal for tasks like deep learning. NVIDIA’s CUDA allows developers to write programs that run directly on the GPU, enhancing performance by managing parallel workloads. While frameworks like PyTorch abstract GPU complexities, understanding CUDA can further optimize performance, especially through custom fused kernels for advanced workloads.
4. NN#7 — Neural Networks Decoded: Concepts Over Code
Metrics are used to evaluate the performance of language models based on type or category of neural networks. In this article, you will get an overview of these metrics such as accuracy, precision, F1-Score, and more.
OmniAI’s benchmark evaluated OCR accuracy of traditional models and Vision Language Models (VLMs) using complex real-world documents. The benchmark used a novel open-source methodology, comparing OCR JSON outputs to ground truth JSON, considering cost and latency for comprehensive provider assessment. This article highlights the details.
Repositories & Tools
- Merlion is a Python library for time series intelligence that provides an end-to-end ML framework.
- DiffSynth Studio is a GitHub platform and codebase that advances diffusion model applications.
- FastRTC simplifies building real-time audio and video AI applications in Python with features like Automatic Voice Detection, Turn Taking, and WebRTC-enabled Gradio UI.
- HELM is a framework to increase the transparency of language models.
- olmOCR is a toolkit for training LMs to work with PDF documents.
- Magma is a foundation model for multimodal AI agents, designed to handle interactions across virtual and real environments.
Top Papers of The Week
1. CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models
In the paper CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models, researchers from Alibaba and other AI labs introduce CodeCriticBench, a benchmark for evaluating the code critique capabilities of Large Language Models (LLMs). It includes code generation and QA tasks with basic and advanced critique evaluations.
2. Self-Rewarding Correction for Mathematical Reasoning
Researchers have developed self-rewarding reasoning language models that independently generate and evaluate their reasoning. Using a two-stage framework, they employ rejection sampling and reinforcement learning to enable self-correction and accuracy assessment. Experiments with models like Llama-3 and Qwen-2.5 show improved self-correction performance, comparable to models using external feedback.
3. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs
Researchers show an effect called emergent misalignment in LLMs. In this experiment, a model is fine-tuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding.
This paper implements a reward agent named RewardAgent. It combines human preference rewards with two verifiable signals, factuality and instruction following, to provide more reliable rewards. RewardAgent significantly outperforms vanilla reward models, demonstrating its effectiveness.
5. Auditing Prompt Caching in Language Model APIs
This paper conducts statistical audits to detect prompt caching in real-world LLM API providers. It shows global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users’ prompts. The audit also highlights timing variations due to prompt caching, which can also result in leakage of information about the model architecture.
Quick Links
1. Anthropic has raised $3.5 billion at a $61.5 billion post-money valuation. According to Anthropic’s blog, they will advance development of next-generation AI systems, expand compute capacity, deepen research in mechanistic interpretability and alignment, and accelerate its international expansion with this investment.
2. AI cloud provider CoreWeave files for IPO. The cloud provider generated $1.92 billion in 2024 revenue, up 737% year over year. In 2024, around 77% of revenue came from two customers, with 62% of the total flowing from Microsoft.
3. OpenRouter crossed 1 trillion tokens this week. OpenRouter provides a unified API that gives you access to hundreds of AI models.
4. Sesame has unveiled its latest research on conversational voice technology, focusing on achieving “voice presence” — a quality that makes interactions with digital assistants feel authentic and emotionally resonant. The centerpiece of this initiative is the Conversational Speech Model (CSM), a new approach to speech generation that uses multimodal learning with transformers.
5. You.com unveils ARI, an AI research agent that processes 400+ sources. This new AI tool is designed to enhance the efficiency and accuracy of research across various industries by leveraging its ability to process and analyze data from over 400 sources simultaneously. This capability represents a quantum leap from the current AI tools that handle only 30–40 sources at a time.
Who’s Hiring in AI
Junior Software Engineer @L3Harris (Rochester, NY, USA)
Software Engineer (Data) @InDebted (Remote/USA)
Machine Learning Engineer @INTEL (Hillsboro, OR, USA)
Analytics Engineer @SingleStore (Pune, India)
Research Fellow — AI & I @Mayo Clinic (Rochester, NY, USA)
Senior Backend Developer — Merchant Services @GoTo Group (Jakarta, Indonesia)
Summer College Intern-IT Data Scientist @City of New York (New York, NY, USA)
Software Developer (Early Career/Young Talent Program) @Crypto.com (Hong Kong)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI