
Training Less, Achieving More: Unlocking Transformers with LoRA
Last Updated on April 15, 2025 by Editorial Team
Author(s): Saif Ali Kheraj
Originally published on Towards AI.
In the era of large language models, Transformer is like the original brain of AI. But they come with a catch: Full fine tuning them is like β¦. Enter LoRA (Low-Rank Adaptation) β βHey, what if we only train the parts we really need?β
Think of LoRA as adding a tiny steering wheel to a giant spaceship. You donβt need to rebuild the engine to change direction, just bolt on a little adapter. In this article, weβll dive into the math , explain how LoRA works under the hood, and show where it fits in the Transformer architecture.
Letβs say you have a neural network layer with:
Input size: d = 10Output size: k = 8
The number of parameters in the weight matrix W0 is 10 x 8 which is 80 . Thatβs fine for small models. But with models like GPT or BERT, weβre talking millions of parameters β and training all of them is expensive, both in time and in your GPUβs emotional well-being.
So LoRA says: βFreeze the big guy, train a tiny plug in instead.β
Normally, a neural layer does:
Now LoRA adds twist.
But instead of making delta W, a full-sized matrix (which would defeat the purpose), we… Read the full blog for free on Medium.
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI