TAI #136: DeepSeek-R1 Challenges OpenAI-o1 With ~30x Cheaper Open-Source Reasoning Model
Author(s): Towards AI Editorial Team
Originally published on Towards AI.
What happened this week in AI by Louie
This week, the LLM race was blown wide open with Deepseekβs open-source release of R1. Performance is close to o1 in most benchmarks. Built on top of DeepSeekβs v3 model, R1 API output token prices are 30x less than o1. Itβs available under the MIT license, supporting commercial use and modifications. Deepseek also disclosed many of its methods and experiments in its paper, in stark contrast to the secrecy surrounding reasoning techniques at AI labs in the U.S.
R1 wasnβt the only huge LLM release from China this week. Two new LLM competitors hit the ground running with very strong models. MiniMax-01, a 456bn parameter Mixture of Experts Model, challenges Googleβs Gemini models for SoTA in long context capabilities. It offers 4 million input context due to its new Lightning Attention (hybrid) architecture. Kimi-1.5, on the other hand, is another new reasoning model that challenges o1 on multimodal capabilities.
Deepseekβs release included three different models/ model families:
DeepSeek-R1-Zero was an experiment that applied reinforcement learning (RL) directly to a base language model (V3) without any prior supervised fine-tuning. In essence, they attempted to teach the model to reason purely through trial and error, providing it with rewards for correct answers and well-formatted responses. This is somewhat analogous to how AlphaZero mastered games like Go and chess, learning solely through self-play and a reward signal based on winning or losing. The results were very impressive on many benchmarks; however, it fell short in some fields, and the modelβs output was often messy and hard to read.
To address the limitations of R1-Zero and enhance its reasoning abilities further, the DeepSeek team introduced R1, which incorporated a βcold startβ of human-like reasoning data before applying reinforcement learning. This involved creating a small dataset of examples demonstrating desired reasoning patterns and output formats. This was followed by a multi-stage process. First, reasoning-oriented RL was applied, focusing on tasks with clear solutions, like math and coding. Then, they generated a new batch of high-quality data samples for fine-tuning, created by filtering model outputs during the RL phase. Finally, they applied a final round of reinforcement learning, this time focusing on general helpfulness and harmlessness in addition to reasoning.
Across key benchmarks like AIME 2024, Codeforces, GPQA Diamond, and MATH-500, DeepSeek-R1 consistently performs on par with OpenAIβs o1 (79.8 vs. 79.2, 96.3 vs. 96.6, 71.5 vs. 75.7, and 97.3 vs. 96.4, respectively). They also got very similar performance on the SWE-bench Verified coding challenge (49.2 vs 48.9).
The final piece of DeepSeekβs work involved distilling the advanced reasoning capabilities of R1 into smaller, cheaper, dense models (Llama and Qwen series). Using the larger R1 model as a βteacher,β they fine-tuned several smaller models (ranging from 1.5B to 70B parameters) on the high-quality data curated from the R1 training process. The smaller distilled models significantly outperformed other models of similar sizes and even rivaled much larger models on reasoning benchmarks. DeepSeek-R1 outputs distilled into the tiny Qwen-1.5B even beat 4o on some math and code benchmarks!
Why should you care?
DeepSeek-R1βs release is significant for several reasons. First, its open-source nature and competitive performance at a fraction of the cost of o1 democratizes access to advanced reasoning capabilities. The API costs of DeepSeek-R1 per million tokens are currently $0.14 for cached inputs, $0.55 for non-cached inputs, and $2.19 for outputs. In contrast, the API costs for o1 are respectively $7.5, $15, and $60. About a x30 difference in costs! Moreover, the open model weights open up huge opportunities for adapting and fine-tuning these models for different domains and industries. The open release of its training methods also provides a blueprint for many others to follow. One surprise from the paper was that simpler techniques for enabling reasoning abilities worked better than some more complex options. We think there is a huge area for exploring and experimenting with these techniques now that scaled reinforcement learning for LLMs has been unlocked!
The huge success shown by distilling big reasoning models into much smaller non-reasoning models also suggests we will get another wave of rapid improvement and cost reduction across the LLM spectrum.
The fact a Chinese company is leading this charge also adds a geopolitical dimension, particularly given that Deepseek has managed to achieve this despite GPU export restrictions and a far smaller budget than Western AI labs.
β Louie Peters β Towards AI Co-founder and CEO
Introducing Our Brand New 8-hour Generative AI Primer Course
A programming language-agnostic 1-day LLM Bootcamp designed for developers.
95% of developers I meet are only scratching the surface of what LLMs can do. When working with LLMs, you are CONSTANTLY making decisions such as open-source vs. closed-source, how to fit LLMs into your use case, whether no-code solutions are good enough for your workflow, the extent to which consider the limitations of LLMs, and so on. And the biggest gap we see on top of all this is whether you are using LLMs to their full capacity, even with chat interfaces like ChatGPT or APIs for models like Gemini. The question is: are you?
This certification course is specifically designed to cut through the noise, help you ask the right questions, and show you exactly how to find answers. LLMs are moving so fast, with updates being released almost every day; what you need is an intuitive βframework,β and just like LLMs, you need enough βcontextβ to know what developments are relevant to you and your use case so you can make the most out of this transformative technology.
In just 8 hours, through lessons, videos, exercises, quizzes, and hands-on projects, youβll:
- Dive deep into the βpsycheβ of LLMs: how they work, how to make them work better, and how to train them for tasks you hate doing.
- Work with leading AI models and integrate them into your workflows seamlessly.
- Build your own no-code/low-code prototype that brings your ideas to life.
Youβll finish before you even realize it, and by tomorrow, youβll already be AI-proofed. Secure your spot now!
Hottest News
1. OpenAI Released Scheduled Tasks in ChatGPT
OpenAI has introduced scheduled tasks in ChatGPT for Plus, Pro, and Team plans. These allow automated prompts and notifications on the Web, iOS, Android, and MacOS. Users can assign tasks like daily updates or reminders and receive notifications via push or email. Windows support will follow in Q1. Currently, a limit of 10 active tasks is enforced.
2. Chinese AI Company MiniMax Releases New Models
Chinese AI company MiniMax, an Alibaba- and Tencent-backed startup, debuted three new models. MiniMax-Text-01 is a text-only model, while MiniMax-VL-01 can understand images and text. T2A-01-HD, meanwhile, generates audio β specifically speech. MiniMax claims that MiniMax-Text-01 performs better than models such as Gemini 2.0 Flash and MiniMax-VL-01 rivals Claude 3.5 Sonnet.
3. Kimi Launches New SOTA Multimodal Model
Beijing Moonlit Dark Side Technology introduced the new Kimi k1.5 multimodal thinking model. Updates include long context extension, improved policy optimization, and multimodality. Its report shows their Sota short-CoT performance outperforms GPT-4o and Claude Sonnet 3.5 on AIME, MATH-500, and LiveCodeBench by a large margin.
4. Alibaba Slashes Prices on LLMs by Up to 85% As Chinaβs AI Rivalry Heats Up
Alibaba Cloud announced an 85% price reduction on its Qwen-VL visual language model. The move demonstrates how competition among Chinaβs technology giants to win more business for their nascent artificial intelligence products is intensifying.
5. Google Is Forming a New Team To Build AI That Can Simulate the Physical World
Google is forming a new team led by Tim Brooks under DeepMind to build AI models for simulating the physical world, collaborating with Gemini, Veo, and Genie teams on βworld models.β These models aid in video generation, multimodal data, and interactive environments.
6. Mistral Signs Deal With AFP To Offer Up-to-Date Answers in Le Chat
Mistral has announced a content deal with newswire Agence France-Presse (AFP) to improve the accuracy of answers in Le Chat, Mistralβs chatbot. Le Chat will be able to tap into AFPβs stories β around 2,300 stories per day in six languages and query AFPβs entire archive dating back to 1983.
7. President Trump Repeals Bidenβs AI Executive Order
President Donald Trump revoked a 2023 executive order signed by former President Joe Biden that sought to reduce the potential risks AI poses to consumers, workers, and national security. During his campaign, Trump promised policies to βsupport AI development rooted in free speech and human flourishing.β
Five 5-minute reads/videos to keep you learning
Retrieval-augmented generation (RAG) and cache-augmented generation (CAG) are two methodologies for generating more context-aware responses from LLMs. This article provides an extensive, step-by-step guide on both approaches, dives into their workflows, compares their advantages and drawbacks, and offers an implementation guide for CAG.
2. Why AI Language Models Choke On Too Much Text
GPUs revolutionized AI by enabling massive parallel processing, leading to transformer models scaling rapidly. Despite advancements, transformers remain inefficient with long contexts due to quadratic compute costs. This article discusses why this happens and shares some approaches to solving this problem.
3. Simplifying Alignment: From RLHF To Direct Preference Optimization (DPO)
This article explores how Direct Preference Optimization (DPO) simplifies aligning large language models with human preferences over Reinforcement Learning with Human Feedback (RLHF). It breaks down the math and highlights why DPO might be the smarter, easier way forward.
4. Mastering Data Scaling: The Only Guide Youβll Ever Need (Straight From My Journey)
Data scaling is a crucial step in ensuring optimal model function. It prepares datasets for machine learning models. This article discusses why scaling is important, its types, and how and when to apply it.
5. Takes On βAlignment Faking in Large Language Modelsβ
Researchers revealed that Claude 3 Opus fakes alignment with training objectives to avoid behavioral modification β a phenomenon labeled βalignment faking.β This author shares their take on the results.
Repositories & Tools
- The micro diffusion repository demonstrates the training of large-scale diffusion models from scratch on a minimal budget.
- LocalAI is a free, open-source alternative to OpenAI, Claude, and others.
- Maxun lets you train a robot in 2 minutes and scrape the web on auto-pilot.
- Agentless is an agentless approach to automatically solve software development problems.
- CopilotKit provides React UI and infrastructure for AI Copilots, in-app AI agents, AI chatbots, and more.
Top Papers of The Week
1. LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs
LlamaV-o1 redefines step-by-step visual reasoning in large language models by introducing a benchmark with eight challenge categories and a metric for granular evaluation. The multimodal model, trained through multi-step curriculum learning, surpasses existing models like Llava-CoT by 3.8% in performance across six benchmarks and runs five times faster during inference.
2. KaLM-Embedding: Superior Training Data Brings A Stronger Embedding Model
Researchers developed KaLM-Embedding, a multilingual embedding model using high-quality, diverse training data. Techniques like persona-based synthetic data, ranking consistency filtering, and semi-homogeneous task batch sampling enhance its performance. The model excels in multilingual embedding tasks, outperforming others of similar size on the MTEB benchmark.
3. Titans: Learning to Memorize at Test Time
This paper introduces a new family of architecture called Titans based on a new neural long-term memory module. The module learns to memorize historical context and helps attention to attend to the current context while utilizing long-past information. Experimental results show that Titans are more effective than Transformers and recent modern linear recurrent models.
4. Transformer 2: Self-adaptive LLMs
This paper introduces Transformer 2, a framework that adapts LLMs for unseen tasks in real-time by selectively adjusting only the singular components of their weight matrices. During inference, Transformer 2 employs a dispatch system to identify the task properties, and then task-specific βexpertβ vectors, trained using reinforcement learning, are dynamically mixed to obtain targeted behavior for the incoming prompt. It outperforms approaches such as LoRA with fewer parameters.
Quick Links
1. Six charts about AI revenue. OpenAI captures approximately 62.5% of consumer AI spending. xAIβs revenue jumped from $5M to $100M, while OpenAI soared from $200M to $5B. Sapphire Ventures reports 28 AI-native companies exceeding $25MM in ARR, predicting substantial growth for AI-native startups in the coming year.
2. DeepSeek-R1 achieves performance comparable to OpenAIβs o1 system across mathematics, coding, and general reasoning tasks, cementing its place as a leading competitor. DeepSeek has open-sourced DeepSeek-R1-Zero and DeepSeek-R1, along with six smaller distilled models.
Whoβs Hiring in AI
Applied AI Engineer, Applied Science @Mistral AI (Paris, France)
Cambridge Internship in ML Model Optimization @Microsoft Corporation (Cambridge, United Kingdom)
Machine Learning Software Engineering Undergraduate Intern @INTEL (Santa Clara, CA, USA)
Tech Consulting AI LLM Developer Manager @Accenture (Multiple Locations)
Full-Stack Developer (React + Python + Azure) @Solvd (Remote)
ββAI/ML Supervisor @Ford Motor Company (Dearborn, MI, USA)
GenAI/Machine Learning Technical Project Manager @Deloitte (Multiple US Locations)
Interested in sharing a job opportunity here? Contact [email protected].
Think a friend would enjoy this too? Share the newsletter and let them join the conversation.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI