The Great LLM Convergence: When Everyone’s Best Becomes Nobody’s Advantage
Last Updated on December 2, 2025 by Editorial Team
Author(s): Ali Khalilvandi
Originally published on Towards AI.

Last year I had an answer when people asked me what LLM to use. GPT-4. Done. This year? I’ve got nothing. And it’s not because the models got worse — it’s because they all got weirdly… the same.
Claude, Grok, Llama, DeepSeek. I genuinely struggle to tell them apart sometimes. That should feel like progress. It doesn’t.
It feels like watching a market commoditize in real time.
November Was Chaos
Late 2025 turned into this bizarre release sprint. Anthropic dropped Claude Opus 4.5 on November 24th, pricing it aggressively for coding work. Google came out with Gemini 3 and its reasoning improvements. OpenAI kept doing their thing — different models for different price points, specialized variants, the usual strategy.
DeepSeek kept being DeepSeek, somehow matching everyone’s performance while apparently spending pocket change on compute.
Here’s the thing though. I can’t tell them apart.

Stanford’s 2025 AI Index backs me up on this. The gap between top models on the Chatbot Arena Leaderboard has basically collapsed. When Claude Sonnet 3.5’s updated version topped the style leaderboard, the margin was so small it barely registered.
We’re talking measurement error territory.
Efficiency Got Weird
So models got cheaper. Way cheaper. And smaller.
Microsoft’s Phi-3 Mini hits benchmark scores that match Google’s PaLM from 2022 — PaLM was massive, Phi-3 Mini is not. Small models are catching up fast. Faster than anyone expected, honestly.

The cost thing is even stranger. Sam Altman said “the cost to use a given level of AI falls about 10x every 12 months.”
Depending on what you’re doing, inference prices have dropped anywhere from 9 to 900 times per year.
Nine hundred times.
Great if you’re a user. Terrible if you’re OpenAI or Anthropic or Google.
These companies spent hundreds of millions training models. Hundreds of millions. And now their moat is just… gone.

OpenAI’s o1 from September 2024 was genuinely impressive on math and reasoning. But it’s expensive. It’s slow. Who’s paying that premium when DeepSeek-R1 gets you comparable results for a fraction of the cost?
The China Thing
This is the part that should worry the Valley. The gap between American and Chinese models has essentially disappeared.

Stanford’s numbers tell the story: American models led Chinese ones by 17.5 percentage points on MMLU in 2023. By 2024? 0.3 percentage points. Point three.
DeepSeek is eating their lunch while spending less money doing it.
If you’re Anthropic or OpenAI, your whole pitch has been that your models are fundamentally better. Worth the premium. Worth the enterprise contracts. Worth the insane valuations.
DeepSeek-R1 came out in January 2025 and matched leading Western models while reportedly using far fewer resources.
Benchmarks Are Lying to You
It wasn’t long ago that Claude confidently hallucinated an AWS CDK method that has never existed, then doubled down with two different fake names when I called it out. GPT and Gemini invented their own equally fictional alternatives in the same conversation. Three frontier models, three different flavors of pure invention.
The leaderboards still brag that these same models score 90–96 % on HumanEval, BigCodeBench, and every other coding eval we treat as gospel.
The benchmarks say these models hit 90%+ on coding tasks. My personal experience disagreed.
This gap between benchmark performance and actual usefulness is revealing something important. The industry’s response to convergence has been to invent harder tests.
Models saturated MMLU, GSM8K, HumanEval — so researchers created “Humanity’s Last Exam” where top systems score 8.8%. There’s FrontierMath where AI solves basically nothing. BigCodeBench where they hit roughly a third of human performance.

We’re not building smarter models. We’re raising the bar to maintain the illusion that there’s still differentiation. And the benchmarks themselves are suspect — a lot of popular evals lean heavily on coding and debugging, which skews everything.
Even OpenAI’s o3, announced December 2024, hit 75.7% on ARC-AGI (previous best was around 55%) but only in a high-compute configuration. We’re not measuring intelligence. We’re measuring who’ll spend the most money gaming the test.
What Actually Matters Now
The differences that matter aren’t showing up on benchmarks.
Claude Sonnet 3.5 feels better for coding. Not because it scores 2% higher on some metric — because it handles context in a way that makes the conversation flow. GPT-4o’s speed is useful because I don’t lose my train of thought waiting for it.
These are UX advantages. Not capability advantages.
The real innovation in 2025 wasn’t capabilities. It was efficiency. Accessibility.
Training a model like Llama 3.1–405B takes roughly 90 days of compute. But smaller models are closing the gap so fast that by the time these giants finish training, they’re already being challenged by models a fraction of their size.
Meta called Llama 3.1 405B “the world’s largest and most capable openly available foundation model” when it dropped in July 2024. Mistral’s compact models beat competitors of similar size across major benchmarks.
The Technology Innovation Institute in Abu Dhabi released Falcon models using state-space architecture — more efficient than traditional transformers.
The Efficiency Revolution
Smaller labs are achieving comparable performance with publicly available approaches. The first-mover advantage evaporated. Within a year of open-sourcing Llama 2, Meta’s technical lead was gone.
Microsoft hired Inflection AI’s leadership, including co-founder Mustafa Suleyman who became CEO of Microsoft AI. They recognized that efficiency, not raw capability, would be the differentiator.
Open-weight models are catching up too. In early 2024, leading closed-weight models outperformed top open-weight models by significant margins on Chatbot Arena. By 2025, that gap had nearly disappeared.
Meta’s Llama, Mistral’s models, smaller players — all converging on the same performance envelope.
The Loyalty That Isn’t
I still use Claude for most coding. Sometimes GPT when I need something specific. But I’m not pretending this is based on meaningful technical superiority. It’s habit. Interface familiarity. The fact that I already paid for the subscription.
The paradox these companies face: they need to keep scaling to stay competitive, but scaling makes their products less differentiated.
Spending more to achieve less advantage. Training compute, dataset sizes, power requirements — all growing exponentially. Performance gains? Linear at best.
Claude Opus 4.5. Gemini 3. OpenAI’s latest whatever. Impressive engineering, all of it. But impressive the way modern smartphones are impressive. Technically sophisticated. Incrementally better than last year.
The question isn’t which model is best anymore. It’s whether “best” means anything when the entire field clusters within a few percentage points. Whether the billions being spent to close those final gaps are going anywhere except into a very expensive dead end.
I don’t have confidence in my LLM recommendations anymore. I’ll probably switch to whatever’s cheapest next quarter.
That’s not loyalty. That’s the market working.
Sources
[1] Anthropic. “Claude Opus 4.5 Release.” https://www.anthropic.com/news/claude-opus-4-5
[2] Mashable. “Gemini 3 vs ChatGPT: Here is how they compare.” https://mashable.com/article/gemini-3-vs-chat-gpt-here-is-how-they-compare
[3] Wikipedia. “DeepSeek.” https://en.wikipedia.org/wiki/DeepSeek
[4] Stanford HAI. “2025 AI Index Report.” https://hai.stanford.edu/ai-index/2025-ai-index-report
[5] Stanford HAI. “AI Index 2025: State of AI in 10 Charts.” https://hai.stanford.edu/news/ai-index-2025-state-of-ai-in-10-charts
[6] Business Insider. “Sam Altman: Cost of using AI drops 10x every 12 months.” https://www.businessinsider.com/sam-altman-cost-using-ai-drop-10
[7] Wikipedia. “OpenAI o1.” https://en.wikipedia.org/wiki/OpenAI_o1
[8] Stanford HAI. “AI Index 2025 Report — Figure 2.1.37.” https://hai.stanford.edu/ai-index/2025-ai-index-report
[9] Built In. “DeepSeek-R1: What You Need to Know.” https://builtin.com/artificial-intelligence/deepseek-r1
[10] IEEE Spectrum. “AI Index 2025.” https://spectrum.ieee.org/ai-index-2025
[11] TechCrunch. “The AI industry is obsessed with Chatbot Arena, but it might not be the best benchmark.” https://techcrunch.com/2024/09/05/the-ai-industry-is-obsessed-with-chatbot-arena-but-it-might-not-be-the-best-benchmark/
[12] ARC Prize. “OpenAI o3 Breakthrough.” https://arcprize.org/blog/oai-o3-pub-breakthrough
[13] Meta AI. “Introducing Llama 3.1.” https://ai.meta.com/blog/meta-llama-3-1/
[14] Wikipedia. “Mistral AI.” https://en.wikipedia.org/wiki/Mistral_AI
[15] AWS Machine Learning Blog. “TII Falcon H1 Models Now Available.” https://aws.amazon.com/blogs/machine-learning/tii-falcon-h1-models-now-available-on-amazon-bedrock-marketplace-and-amazon-sagemaker-jumpstart/
[16] Wikipedia. “Inflection AI.” https://en.wikipedia.org/wiki/Inflection_AI
[17] Stanford HAI. “Technical Performance Analysis.” https://hai.stanford.edu/technical-performance
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.