DeepSeek in My Engineer’s Eyes
Author(s): Kelvin Lu
Originally published on Towards AI.
It’s been almost a year since my last post — June of last year, to be exact. The reason? I simply didn’t come across anything I felt was exciting enough to share. Don’t get me wrong; this isn’t to say there hasn’t been progress in AI, or that my past six months have been unproductive. On the contrary, there have been significant advancements in the field, and my own work has been quite fruitful.
That said, I’ve noticed a growing disconnect between cutting-edge AI development and the realities of AI application developers. Take, for example, the U.S. government’s $500 billion investment in the Stargate project. While it’s an ambitious endeavour, does it really matter to most of us what specific technologies will be used? If this is the direction AI is heading, it feels like the forefront of innovation will increasingly become the domain of just two players: the U.S. and China. For the rest of the world, it doesn’t matter whether you are an interested individual, company, or country; you just don’t have the opportunity.
Then there was the application-level technology, like RAG and AI agents. RAG, while useful, is ultimately a design pattern — not a complete, out-of-the-box solution. Because of its lack of reasoning capability, it is a pretty dumb solution. AI agents, on the other hand, hold a lot of promise but are still constrained by the reliability of LLM reasoning. From an engineering perspective, the core challenge for both lies in improving accuracy and reliability to meet real-world business requirements. Building a demo is one thing; scaling it to production is an entirely different beast.
Everything changed when Deepseek burst onto the scene a month ago. My experience felt like driving down a long, boring stretch of road at night. My eyes were half-closed, lulled by the dull hum of the engine. Then, out of nowhere, a roaring race car sped past me, kicking up a cloud of dust as it vanished into the distance in mere seconds. I sat there, wide-eyed, jaw dropped, staring at the haze left in its wake. That moment was a month ago, yet the shockwaves of that encounter still echo in my mind.
Deepseek has disrupted the world in countless ways. Some have labeled it a national security threat, copier, gate-tailer, data theft, distiller, etc. I dismiss these claims entirely. In the boxing ring, emotions can cloud judgment. If you turn emotional, you are already lost. When Tyson bit Holyfield’s ear in front of billions of TV viewers, it was a moment of weakness, not strength.
In this post, I want to shift the conversation to how Deepseek is redefining the future of machine learning engineering. It has already inspired me to set new goals for 2025, and I hope it can do the same for other ML engineers. Let’s explore what this means for our field and how we can rise to the challenge.
AI Growth Pattern Redefined
For a long time, it has been widely assumed that AI development is governed strictly by the scaling law — the idea that model performance improves with exponentially larger datasets and greater computational resources. This belief has not only created barriers for application developers but also raised serious questions about the sustainability of AI progress. When the U.S. government deems it necessary to invest $500 billion in the next generation of AI, one has to wonder: What is the roadmap for generating a positive return on such an investment? And what will be the cost of Stargate version 2? $5 trillion? That is the annual budget income for the US federal government! Ironically, the Stargate’s roadmap towards AGI is via bruit force, unintelligent at all.
Consider OpenAI, the leading player in the field, is still far from breaking even. The skyrocketing costs of training large language models are beginning to resemble a Ponzi scheme, where the promise of future returns solely justified the ever-increasing expenditures. This raises concerns about the long-term viability of such an approach and whether the AI industry is heading toward a financial reckoning.
Really? AI Revolution is Losing Steam?
Concerns of sustanability of current AI development and prediction of the future AI
pub.towardsai.net
Deepseek’s practice indicates that when computing power reaches a certain scale, further increasing it will have a diminishing effect on model performance. With its more than a dozen optimisation and noval algorithms, it was able to achieve same or even better performance with a fraction of the cost and resources of other leading LLM. Some analysts call this a “turning point of computation starvation”.
The most important encouragement I got from Deepseek was that the formidable large training datasets are not an insurmountable barrier, and expensive hardware is not a hard limit. With the right skills, determination, and a bold heart, we can conquer them all.
ML Engineering Redefined
Unlike most LLM technical reports that only experimented a small number of new algorithms, Deepseek very generously presented a long list of new development:
- 128K-1M tokens long context window
- MLA,
- MOE load balancing,
- GRPO,
- HAI — their self-built super efficient training platform,
- Mixed precision training,
- Multi-token prediction,
- Decoupled Rotary Position Embedding,
- First use of RL in LLM training,
- First use of PTX, the assembly language in GPU programming, in model training.
Does these looks like Deepseek has copied from other leading company? I think they are time-travelled back from 10 years’ future.
It is fascinating what Deepseek has achieved with their top noche engineering skill. They also inspired a bunch of new potentials for ML engineers.
New Standard of Data quality
Deepseek has made significant strides in understanding the role of training data quality in AI model development. Their research highlights that high-quality data is more impactful than sheer volume, as noisy or biased data can undermine model performance despite extensive computational resources. To address this, Deepseek employs rigorous data filtering and deduplication, ensuring only relevant and accurate data is used. They also focus on bias mitigation, using techniques like data augmentation, synthetic data generation, and balanced sampling to create diverse, representative datasets.
Deepseek advocates for a data-centric approach, prioritising data quality over model architecture improvements. They have developed tools for automated data cleaning, label validation, and error analysis, enabling efficient identification and correction of data issues. Their experiments show that curated datasets lead to more robust and reliable models, even with smaller data sizes, challenging the traditional emphasis on scaling data volume.
New Possibility Provided by the Mixed Precision Model
Low precision deployment is not new. The most common way is to deploy a LLM that is trained in full precision in low precision mode. The drawback is, the low precision deployment has lower accuracy than the full precision deployment.
Deepseek’s mixed precision architecture is a groundbreaking innovation that optimises AI model training and inference by combining different numerical precisions. This approach delivers significant benefits for both model performance and downstream application development. By using lower precision, mostly FP-8, for most calculations, Deepseek reduces memory usage and computational load, enabling faster training and inference while maintaining model accuracy. Strategic use of higher precision for critical operations ensures that model performance remains robust and reliable. Thus, it hits the balance of efficiency and accuracy.
Most LLM are released in FP-32, and developers have to deploy them either to a larger profile environment or to a low profile environment using technology called quantisation. Deepseek models are released in FP-8, that means a 7b Deepseek model can be deployed to a consumer grade GPU without performance degradation. That enables developers to experiment with lower budget, faster inference speeds for real-time applications, or high throughput via reasonable larger cluster.
Incredible RL-based Fine Tuning
The novel utilisation of RL-based fine tuning is another breakthrough.
Traditionally, techniques such as Supervised Fine-Tuning (SFT) played a crucial role in improving model performance and domain knowledge adaption. SFT involves training a pre-trained model further on task-specific labeled datasets to refine its outputs. While effective in many applications, SFT inherently relies on a brute-force method — more data, longer training times, and greater computational demands. Despite its benefits, SFT follows a pattern of diminishing returns, where merely increasing computational resources and data does not proportionally enhance performance. Let along the difficulty of collecting task-specific labeled data.
Unlike traditional fine-tuning methods that rely on static datasets, RL-based fine-tuning leverages dynamic feedback loops to refine model behavior, making it particularly powerful for complex, real-world applications. To be specific, it offers the following benefits:
- Dynamic Adaptation
RL-based fine-tuning allows models to learn from real-time feedback, enabling them to adapt to changing environments and user needs. This is especially valuable in applications like recommendation systems and autonomous systems, where conditions are constantly evolving. - Task-Specific Optimization
By defining specific reward functions, developers can guide models to optimize for particular objectives, such as maximizing user engagement, minimizing errors, or improving efficiency. This targeted approach ensures that models perform exceptionally well in their intended tasks. - Handling Complex Scenarios
RL excels in environments with sparse or delayed rewards, making it ideal for fine-tuning models in complex scenarios where traditional supervised learning struggles. For example, in robotics or strategic games, RL-based fine-tuning enables models to learn nuanced strategies over time. - Continuous Improvement
Unlike one-time fine-tuning, RL-based methods enable continuous learning. Models can iteratively improve their performance as they interact with new data and environments, ensuring long-term relevance and accuracy.
RAG has been widely recognised as a significant advancement in Generative AI technology. However, its lack of reasoning capabilities limits its ability to handle complex queries effectively. Similarly, Agentic development also relies on high accuracy, tunable reasoning LLM. This is where Deepseek, with its robust reasoning capabilities, comes into play as an ideal complement. I envision a future where reasoning models like Deepseek seamlessly integrate with RAG and agent to tackle more sophisticated tasks with advanced reasoning.
Disadvantages of RAG
This is the first part of the RAG analysis:
medium.com
One feature I particularly admire is RL-based FT’s ability to continuously improve. This is a critical gap in the current GenAI development, as it lacks mechanisms for ongoing enhancement. From an application developer’s perspective, continuous improvement is essential for scaling a proof-of-concept into a fully-fledged product. Deepseek’s approach not only addresses this need but also sets a new standard for building adaptable and scalable AI solutions.
High-Performance Team Redefined
How Deepseek has managed to catch up to — and even surpass — OpenAI’s top-performing models is phenomenal. What makes this even more astonishing is the contrast in team size: Deepseek operates with just 136 employees, compared to OpenAI’s 3,500. This isn’t an isolated case, either. History is filled with examples of small, nimble unicorn companies achieving extraordinary success against all odds:
- When Eric Schmidt stepped up as Google’s CEO in 2001, the company had fewer than 300 employees.
- Amazon, founded even earlier, had only 158 employees on the eve of its IPO in 1997.
- When WhatsApp was acquired for $19 billion in 2014, it had only 50 employees.
- When Instagram was sold for $1 billion in 2012, it had only 13 employees.
There is one thing we can be sure of: successful innovation requires a chain reaction of creativity within the team and a stroke of good fortune. But why are they often unable to maintain the initial momentum as they grow larger? And why do so many big companies fail, despite their ability to offer top salaries, attract the brightest talent, and access far greater resources?
These questions have sparked many fascinating discussions. I’d like to share a lesson I learned from my mentor at the start of my consulting career:
Larger corporations tend to have a lower collective IQ.
This may seem aggressive or even offensive, but it’s not what you might think. After a little refinement, the concept could serve as the icebreaker for management consulting. While large companies often employ more intelligent individuals, their complex structures slow down the flow of information and knowledge, hinder cooperation, and make them less responsive to market and technical trends. This is what it meant by a “low IQ enterprise.”
Wenfeng Liang, the CEO of Deepseek, shared in an interview that his company has a self-organizing team. When a young engineer proposes a new idea about optimal model structure, a team automatically forms around him. The outcome was the very successful MLA: Multi-head Latent Attention. He also mentioned that in his company, the main meeting room always has its door wide open, allowing anyone passing by to join the discussion if they have an idea. Does this sound like your company?
That’s the difference in enterprise IQ.
Don’t be discouraged if your company isn’t like this. Indeed, the top performing team are rare. Most companies are not designed to stimulate chain-reaction in the team. That is hard to achieve in small group, and impossible in large companies. Based on our discussion, it’s clear that Deepseek’s remarkable success as a small company isn’t exceptional. When it grows ten times larger, who knows, it could very well become just another average company.
Top-performing teams are always rare, as rare as one can partner with an Olympic medalist. If you’re lucky enough to be part of one, don’t leave for any trivial temptations. You may never get another chance in your life to truly pour your heart into your work with such joy.
Parting Words
Deepseek is a milestone indicating that Generative AI is at a pivotal turning point, transitioning to a fundamentally different style of development and deployment. While the previous technologies in the engineer’s toolbox are RAG or Agent, the design and engineering of large language models are now more accessible than ever, enabling a seamless integration of capabilities that were previously siloed. This shift has made LLM tuning and training significantly more available to application project teams, empowering them to tailor models to specific use cases. As a result, the barrier to entry for leveraging cutting-edge AI technologies has lowered, opening up new opportunities for innovation across industries.
Looking ahead to 2025, my focus will be on diving deeper into Reinforcement Learning, a critical capability for the next generation LLM fine tuning and application building. Additionally, I plan to get hands-on with custom LLM tuning, data preparation, and hosting, ensuring I can build and deploy models that are both powerful and performant. By mastering these skills, I aim to getting prepared for the next wave of AI-driven solutions.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI