Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: pub@towardsai.net
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab VeloxTrend Ultrarix Capital Partners Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Free: 6-day Agentic AI Engineering Email Guide.
Learnings from Towards AI's hands-on work with real clients.
Deep Compression, 2015: How Much More Can We Squeeze in 2025?
Latest   Machine Learning

Deep Compression, 2015: How Much More Can We Squeeze in 2025?

Last Updated on December 29, 2025 by Editorial Team

Author(s): Vasyl Rakivnenko

Originally published on Towards AI.

Deep Compression, 2015: How Much More Can We Squeeze in 2025?
Image generated with ChatGPT-5.2

It may be hard to believe, but compression of Neural Networks was already an important topic more than 25 years ago. Yann LeCun, in his paper Optimal Brain Damage, published on the NeurIPS in 1989, was arguing that magnitude-based neuron scoring was too simplistic and introduced instead a much more advanced and efficient technique — computing the diagonal second derivatives [which was also introduced by the same Yann LeCun a year earlier in 1987]. This technique was the same order of complexity as computing the gradient, but was much better at identifying parameters whose deletion would cause the slightest decrease in model accuracy. This paper not only discussed diagonal Hessian, but also explained that after pruning, the network should be retrained, and that this procedure can be iterated, which is a widely known (called iterative pruning) and used technique nowadays. As a result, LeCun and his team were able to compress their Neural Network on 30% without a decrease in accuracy and, additionally, significantly improved speed. And it was back in !!! 1989 !!!

TL;DR: I revisited Han et al.’s Deep Compression (prune → retrain → quantize/weight-share → Huffman) on LeNet-300–100 and reproduced the classic result: keeping accuracy while achieving ~22× compression.

Then I introduced a TF-IDF-style activation-aware pruning score and with simple tuning, it pushed compression well beyond the baseline — up to ~65× with same ~96% accuracy after retraining.

Main Reasons to Compress AI Models

The main reason that we hear for compressing models is that it is the way to overcome the limitations of large, resource-intensive models with their high and constantly increasing computational and memory requirements, and that it enables running them on smaller devices, increases speed, and makes inference more cost-effective and energy-efficient.

But there’s another reason that makes compression interesting, and that I think we need to focus on — training a smaller network is proven to have poorer generalization compared to an equivalently sized pruned network from a larger pre-trained network. Which means that if we want to get the best out of a 1B parameter model, we can train a 2B model and then optimize it, compressing it to 1/2 of its initial parameters, and it should [normally] be more efficient than just training a 1B parameter model.

The reason for this phenomenon was well explained in the paper called “Loss landscapes and optimization in over-parameterized non-linear systems and neural networks,” and it is also very well explained in this Youtube video. Basically, the reason is that in over-parameterized models, the minima form a manifold with no bad local minima (image below) and so gradient methods can reach the best training loss fast and efficiently.

Source: “Loss landscapes and optimization in over-parameterized non-linear systems and neural networks” https://arxiv.org/pdf/2003.00307

So as we now know, it makes sense to train a larger model and then compress it to the point when it starts affecting the model's accuracy. Depending on the model architecture and how much overparameterized it is, it is possible to get from 2X up to over 20X compression rate without dropping the model’s accuracy.

Interestingly, many people [and also Google Search — screenshot below] believe that this was proved in 2019 by the “Lottery Ticket Hypothesis” (LTH) paper by Frankle and Carbin.

My own screenshots, when I was trying to find the 1997 paper by Castellano et al.

But we know [and also ChatGPT] that it was first proved and explained in the 1997 paper called “An Iterative Pruning Algorithm for Feedforward Neural Networks” by Castellano et al. Although it was not the main topic of this paper, it was the first time that training overparametrized models and then pruning them to a smaller size was proven to be more effective than just training a small model of the same size that we prune our bigger model to.

Pruning vs Quantization

Before diving into experiments, I’d like to briefly discuss the difference between pruning and quantization. Until recent years, pruning weights was the most commonly used technique to reduce the number of parameters in a pre-trained DNN. The main idea of pruning was mentioned at the beginning of this blog — identify the set of parameters whose deletion will cause the smallest decrease in model performance and delete (mask) them so the model does not spend its computational resources, electricity, and time on them. The image below illustrates the difference between pruning weights and pruning neurons together with all its weights.

Source: Learning both Weights and Connections for Efficient Neural Networks, S. Han et al.

I believe it’s important to mention that nowadays, quantization has become a much more popular approach for DNN compression, as quantized models tend to be more accurate than pruned ones. This NeurIPS 2023 paper proved this point — quantization is almost always provably better than unstructured, semi-structured, and structured pruning.

Source: Pruning vs Quantization: Which is Better?, A. Kuzmin et al.

Quantization can be easily understood with this illustration below. It is a technique that reduces the precision of model parameters, decreasing the number of bits needed to store each parameter. For example, a 32-bit precision value of 7.892345678 can be quantized to the integer 8, which is an 8-bit precision value.

Source: https://www.digitalocean.com/community/tutorials/model-quantization-large-language-models

I think that another argument that quantization gained much wider adoption is that HuggingFace Transformers support 19 quantization methods and not a single pruning method at all. Important to mention that some of the reasons for that could be the fact that pruning is a complimentary method, it can be achieved by ‘radical’ quantization, and another reason is that most of the pruning methods, like magnitude-based pruning, APoZ, or WANDA, are actually not hard at all to implement from scratch, while quantization has more advanced methods and normally comes in pair with another technique, called weights sharing. The simplified idea behind weights sharing is that instead of, for example, storing 400.000 parameters with the same value of 2, why not group (cluster) them and map to all the corresponding weights to it, saving on storage of 399.999 same values of 2.

Deep Compression

One of the most important papers that crystallized our modern Deep Neural Network compression pipeline was Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, Han et al. It was first published in October 2015 on arXiv by Song Han, Huizi Mao, and William J. Dally, and later, published as a conference paper at ICLR 2016, and it currently has over 12.000 citations. If you don’t want to read the entire paper, you can get an overall idea from the illustration below and a much better understanding from the following experiments.

Source: Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding, Han et al.

For example, by applying their pruning -> quantization -> Weight Sharing -> Huffman Encoding Song and his team achieved 40X compression on the LeNet-300–100 Neural Net without losing model accuracy and, in fact, even slightly improving.

Source: Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, https://arxiv.org/abs/1510.00149

So, inspired by how much I enjoyed my previous post [A 1989 ConvNet: What’s Changed Since Karpathy Updated LeCun’s 33-Year-Old Code 3 Years Ago], I decided to try to do something similar with this, Deep Compression, masterpiece, and see if I can, and, if yes, how much more I can compress the same Neural Net now, 10 years later.

Running the Original Implementation

As I was more interested in building on the Deep Compression work to see if I’m able to achieve even higher than 40X compression rates, I decided to stick to one of the best practices in software development — reusing the code [in this case, someone else’s code 😅 ], so I found a great implementation of the paper and forked it from mightydeveloper (big Thanks to him!).

After running the pruning.py file, the initial model was:

--- Before pruning ---
fc1.weight | nonzeros = 235200 / 235200 (100.00%) | shape = (300, 784)
fc1.bias | nonzeros = 300 / 300 (100.00%) | shape = (300,)
fc2.weight | nonzeros = 29999 / 30000 (100.00%) | shape = (100, 300)
fc2.bias | nonzeros = 100 / 100 (100.00%) | shape = (100,)
fc3.weight | nonzeros = 1000 / 1000 (100.00%) | shape = (10, 100)
fc3.bias | nonzeros = 10 / 10 (100.00%) | shape = (10,)

accuracy after 100 epochs — accuracy 95%, but honestly, the loss was barely improving after around the 50th epoch, so for my next experiments, I decided to reduce the training to half, from 100 to 50 runs.

Train Epoch: 99 [59500/60000 ( 99%)] Loss: 0.467588: 100%|██████████████████| 1200/1200 [00:03<00:00, 333.59it/s]
Test set: Average loss: 0.2164, Accuracy: 9451/10000 (94.51%)

After pruning was completed, the test validation showed that the model’s accuracy dropped 23% down to 72%:

Test set: Average loss: 1.4825, Accuracy: 7164/10000 (71.64%)

But we removed (masked) 95% of the model’s weights, achieving a 22X compression rate just after the first step.

--- After pruning ---
fc1.weight | nonzeros = 10483 / 235200 ( 4.46%) | total_pruned = 224717 | shape = (300, 784)
fc1.bias | nonzeros = 300 / 300 (100.00%) | total_pruned = 0 | shape = (300,)
fc2.weight | nonzeros = 1396 / 30000 ( 4.65%) | total_pruned = 28604 | shape = (100, 300)
fc2.bias | nonzeros = 100 / 100 (100.00%) | total_pruned = 0 | shape = (100,)
fc3.weight | nonzeros = 71 / 1000 ( 7.10%) | total_pruned = 929 | shape = (10, 100)
fc3.bias | nonzeros = 10 / 10 (100.00%) | total_pruned = 0 | shape = (10,)
alive: 12360, pruned : 254250, total: 266610, Compression rate : 21.57x ( 95.36% pruned)

Now we can clearly see that [aggressive] pruning leads not only to a reduction of storage and an increase of the model’s speed, but it also affects the model’s main performance — accuracy. Following as the paper suggests and the forked repository implements — next step is retraining the model for another 100 epochs. The result — accuracy got back to 96%.

Train Epoch: 99 [59500/60000 ( 99%)] Loss: 0.183090: 100%|██████████████████| 1200/1200 [00:03<00:00, 334.16it/s]
Test set: Average loss: 0.1332, Accuracy: 9620/10000 (96.20%)

So after our first step, quantization, we now have a model with only 4,64% of remaining (functioning) parameters and 22X compressed rate with the same level of accuracy.

Weight sharing & Huffman Code

After running weight_share.py and huffman_encode.py, I got to 96.26% accuracy. Which is expected, as the Huffman code is considered lossless, and weight sharing can be lossy or lossless depending on the model complexity. As we‘re using a toy-size model for modern times with models reaching trillions of parameters, it’s not surprising to me that applying both algorithms resulted in no accuracy loss.

accuracy before weight sharing
Test set: Average loss: 0.1332, Accuracy: 9620/10000 (96.20%)
accuacy after weight sharing
Test set: Average loss: 0.1355, Accuracy: 9626/10000 (96.26%)

So now, it was time to find a better way to compress it even more. And the best possible way to do that for me was to try to invent (or re-invent without knowing it existed) a new approach…

My Prominence-based Scoring Method

For some time, I was thinking that if I were able to figure out an efficient way to identify the most informative parameters, I’d be able to get to a more efficient way to compress models. By efficient, I mean without doing heavy math like computing Taylor approximations or Hessian diagonals. This would let me prune most useless parameters without spending hours (in case of models like Llama-7B) and even days of GPU (for heavier models with tens of billions of parameters).

The logic behind my method was that if a channel (or maybe it could be even be applicable on the neuron level) fires strongly for SOME classes and weakly for OTHERS, then it helps distinguish between categories, so we want to keep it.

Example:
Class "dog": activation = 8.5
Class "cat": activation = 1.2
Class "truck": activation = 0.3
→ Variance is HIGH → Dog detector → USEFUL

If a channel fires for all classes, it means that it is not selective and can be too generalist, so it does not contribute for distinguishing between classes, so we can easily prune it.

Example:
Class "dog": activation = 7.8
Class "cat": activation = 7.5
Class "truck": activation = 8.1
→ Variance is LOW → Responds to everything → USELESS

Lastly, a channel that fires weakly (or doesn’t fire at all) for all (or most of the) classes is not contributing to our model’s accuracy, either, and so we want to prune it.

Example:
Class "dog": activation = 0.1
Class "cat": activation = 0.2
Class "truck": activation = 0.1
→ Mean is LOW → Dead neuron → USELESS

During a short pass over the training data, forward hooks collect per‑neuron statistics on our masked layers. Then the algorithm keeps the MB magnitude term but multiplies it by TF–IDF style activation statistics:

score = |w|^weight_power × TF_component × IDF_component

The final score is the elementwise product: `score = weight_component × tf_component × idf_component`, masked to zero for already-pruned weights. Adjusting the TF/IDF hyperparameters lets us emphasise frequent, discriminative neurons relative to pure magnitude.

Threshold and Prune

Scores are compared against a threshold determined either globally or per layer:

  • Percentile mode: prune the lowest‑scoring X%.
  • Sensitivity mode: prune scores below σ × sensitivity, where σ is the score standard deviation.

Connections with scores below the chosen threshold have their weights and mask entries zeroed, effectively removing them from the model

This TF‑IDF‑like approach favors weights fed by rare-but-informative activations, allowing more data-aware pruning than plain magnitude-based methods. Adjusting the activation threshold, smoothing, additive constant, and exponent parameters lets you explore different weighting schemes and pruning aggressiveness.

Experiments and results

The first run showed me that it was a promising approach. My first tweak in optimizing it was increasing sensitivity 2.2 — how happy I was to see the results! Compression rate 25x and accuracy of 87.45% before retraining (vs 72% with initial approach!). It was clear to me — I made it! After an additional 100 epochs, I got the same accuracy of 96% as in the initial approach, but with an additional 15.6% of compression. And I was feeling that I could squeeze more!

Train Epoch: 99 [59500/60000 ( 99%)] Loss: 0.195088: 100%|██████████████████| 1200/1200 [00:03<00:00, 338.79it/s]
Test set: Average loss: 0.1350, Accuracy: 9613/10000 (96.13%)
--- After Retraining ---
fc1.weight | nonzeros = 9695 / 235200 ( 4.12%) | total_pruned = 225505 | shape = (300, 784)
fc1.bias | nonzeros = 300 / 300 (100.00%) | total_pruned = 0 | shape = (300,)
fc2.weight | nonzeros = 544 / 30000 ( 1.81%) | total_pruned = 29456 | shape = (100, 300)
fc2.bias | nonzeros = 100 / 100 (100.00%) | total_pruned = 0 | shape = (100,)
fc3.weight | nonzeros = 41 / 1000 ( 4.10%) | total_pruned = 959 | shape = (10, 100)
fc3.bias | nonzeros = 10 / 10 (100.00%) | total_pruned = 0 | shape = (10,)
alive: 10690, pruned : 255920, total: 266610, Compression rate : 24.94x ( 95.99% pruned)

My next move was increasing sensitivity to 2.5 — and I wasn’t disappointed again. I got 30x compression with same accuracy as with sensitivity of 2.2 right after pruning.

Test set: Average loss: 0.6171, Accuracy: 8746/10000 (87.46%)
--- After pruning ---
fc1.weight | nonzeros = 7833 / 235200 ( 3.33%) | total_pruned = 227367 | shape = (300, 784)
fc1.bias | nonzeros = 300 / 300 (100.00%) | total_pruned = 0 | shape = (300,)
fc2.weight | nonzeros = 492 / 30000 ( 1.64%) | total_pruned = 29508 | shape = (100, 300)
fc2.bias | nonzeros = 100 / 100 (100.00%) | total_pruned = 0 | shape = (100,)
fc3.weight | nonzeros = 37 / 1000 ( 3.70%) | total_pruned = 963 | shape = (10, 100)
fc3.bias | nonzeros = 10 / 10 (100.00%) | total_pruned = 0 | shape = (10,)
alive: 8772, pruned : 257838, total: 266610, Compression rate : 30.39x ( 96.71% pruned)

And after retraining the model with the same 100 runs, accuracy was 96%, and it’s after 30x compression, which is an additional 41% to the initial approach, which was producing 22% of compression on this stage. Further pushing the sensitivity to 2.7 almost didn’t have any impact on accuracy, but gave 34x compression. Further pushing sensitivity to 3 gave even better results. This was something unreal, but with a sensitivity of 4, it gave us 65X compression with 95.83% accuracy:

Train Epoch: 99 [59500/60000 ( 99%)] Loss: 0.110744: 100%|██████████████████| 1200/1200 [00:03<00:00, 333.04it/s]
Test set: Average loss: 0.1508, Accuracy: 9583/10000 (95.83%)
--- After Retraining ---
fc1.weight | nonzeros = 3330 / 235200 ( 1.42%) | total_pruned = 231870 | shape = (300, 784)
fc1.bias | nonzeros = 300 / 300 (100.00%) | total_pruned = 0 | shape = (300,)
fc2.weight | nonzeros = 313 / 30000 ( 1.04%) | total_pruned = 29687 | shape = (100, 300)
fc2.bias | nonzeros = 100 / 100 (100.00%) | total_pruned = 0 | shape = (100,)
fc3.weight | nonzeros = 20 / 1000 ( 2.00%) | total_pruned = 980 | shape = (10, 100)
fc3.bias | nonzeros = 10 / 10 (100.00%) | total_pruned = 0 | shape = (10,)
alive: 4073, pruned : 262537, total: 266610, Compression rate : 65.46x ( 98.47% pruned)

After getting these crazy results, I decided to google and figure out if I re-invented an algorithm that, perhaps, was used half a century ago, or if it was something new. After googling for a few minutes, I decided to use heavy artillery and asked Gemini, ChatGPT, and Claude to conduct a deep research — after reviewing more than 200 sources, they came to a conclusion that it was a genuinely novel approach. See screenshots below:

I don’t actually think I invented something valuable for the current level of complexity that our SOTA LLMs have, but I believe this TF-IDF-inspired approach should be studied for other applications — for compression of more complex models, maybe as a regularization enhancement, or perhaps for de-biasing LLMs.

Then I did some more experiments with some more tweaks, benchmarking my approach (marked as NeuronRank on the graph) to the APoZ, Fisher, and Taylor methods. You can see the results below.

Looks pretty good, taking into account that it’s way less heavy computationally than the first or second degree information.

Conclusion

Deep Compression still holds up: prune → retrain → quantize/weight-share → Huffman reliably recovers accuracy while delivering large compression. Reproducing it confirmed the key lesson — retraining is what makes aggressive pruning workable.

The main new result I was able to come up with was that a lightweight, TF-IDF-style activation-aware scoring can preserve more accuracy right after pruning and push compression further on this toy model (up to ~65× with minimal accuracy loss after retraining).

Next step: compare to WANDA, validate at scale on modern architectures, and against strong baselines to see if the same ‘rare-but-informative activations’ signal remains useful there.

You can find the full code and the TF-IDF-inspired algorithm on my GitHub (folder NET). If you do something similar — I’d really love to hear about it. Let’s connect on LinkedIn!

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI


Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

6-Day Agentic AI Engineering Email Guide — one practical lesson per day

Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.