
How Are LLMs Trained: For Engineers
Author(s): Harsh Chandekar
Originally published on Towards AI.
In late 2022, ChatGPT burst onto the scene, igniting a global frenzy around artificial intelligence. This remarkable success, however, brought to light a significant challenge: the immense computational demands of large language models (LLMs) during both training and deployment. Imagine trying to build an entire city, block by block, with only a handful of construction workers and limited resources. That’s a bit like training and running today’s colossal LLMs without smart optimization strategies.
The Herculean Task of Training LLMs
Training an LLM can be compared to teaching a student everything there is to know about the world from an enormous digital library. This process is incredibly resource-intensive, with model parameters growing tenfold each year, sometimes requiring months of training on thousands of GPUs.
Before we delve into the solutions, it’s crucial to understand a core concept: knowledge in LLMs.
Knowledge refers to the information and patterns encoded in the model’s parameters based on the data it was trained on.
Sources differentiate between two types:
Worldly Knowledge:
This is akin to human knowledge, where an agent draws on an internal “world model” that structurally matches the real world, allowing them to perceive, reason, plan, and reliably generate truth-preserving and relevant content about it. Think of it as truly understanding the underlying facts and relationships.
Examples:
- Paris is the capital of France.
- Water boils at 100°C at sea level.
LLMs learn a vast amount of this type of knowledge during training by reading books, articles, Wikipedia, and web data.
Instrumental Knowledge:
This refers to an LLM’s ability to use its next-word generation capability as a tool to perform tasks, often by spontaneously inferring task structure. LLMs, trained primarily on text, acquire instrumental knowledge, and the extent to which this relies on true worldly understanding is a key area of exploration.
Examples:
- How to write a Python function.
- How to cook rice.
- How to write a business email.
LLMs learn patterns and procedures, allowing them to generate code, write essays, and perform multi-step reasoning.
Now, for the optimization techniques:
How to optimize it?
1. Data I/O Optimization: Feeding the Beast Efficiently
Imagine collecting an astronomical number of books (data) for your LLM to read. If these books are scattered, poorly organized, or stored on slow systems, the training process grinds to a halt. This is the challenge of data I/O.
Efficient data I/O involves:
- Handling Massive Data: LLMs often process petabyte-scale data, requiring efficient reading and preprocessing.
- Overcoming Storage/Bandwidth Limits: Traditional storage systems can become bottlenecks, causing delays in data reading.
Think of a massive library where books (data) are scattered everywhere, and you need to find specific information quickly for your project (training). Efficient data I/O is like having a perfectly organized digital catalog and high-speed robots to fetch books instantly, rather than rummaging through dusty shelves.
2. Memory Optimization: Taming the Data Dragon
LLMs contain billions, even trillions, of parameters. Each of these parameters, along with intermediate calculations (activations) and optimizer states, demands significant memory.
Key techniques include:
- Gradient Checkpointing: This technique balances memory efficiency with computational demands. Instead of storing all intermediate activation values during the forward pass (which are needed for the backward pass to calculate gradients), it recomputes them when needed. This saves memory at the cost of some additional computation.
- ZeRO (Zero Redundancy Optimizer) Series: Developed by Microsoft as part of the DeepSpeed library, ZeRO is a suite of techniques designed to drastically reduce memory redundancy and optimize resource utilization for training very large models.
- ZeRO-1: Partitions the optimizer states across GPUs, so each GPU only stores a portion.
- ZeRO-2: Partitions optimizer states and gradients.
- ZeRO-3: Further partitions model parameters, in addition to optimizer states and gradients, ensuring no redundancy across GPUs.
- ZeRO++: Builds on previous versions, focusing on improving overall efficiency, scalability, and ease of use, including supporting hybrid parallelism.
3. Parallelism Optimization: Many Hands Make Light Work
Training a colossal LLM often requires distributing the workload across hundreds or thousands of GPUs. This is where parallelism comes in.
3D Parallelism: This is a comprehensive strategy that integrates three main forms of parallelism to train large-scale models efficiently:
- Data Parallelism: The simplest form, where each GPU has a complete copy of the model, and different batches of data are processed in parallel. Gradients are then synchronized and averaged across all GPUs.

- Tensor Parallelism: The model’s tensors (like weight matrices) are split across multiple GPUs. This reduces the memory footprint of the model on any single GPU but requires frequent communication of intermediate results during forward and backward passes.

- Pipeline Parallelism: The model’s layers are divided into stages, and each stage is assigned to a different GPU or group of GPUs. Data flows through these stages in a pipeline, improving throughput.

Think of training a giant model like building a car on an assembly line. Data parallelism is like having multiple identical assembly lines, each building a complete car. Tensor parallelism is like having different teams on one line, each focusing on a specific part (engine, chassis, etc.). Pipeline parallelism is like dividing the car-building process into stages, with each stage handled by a dedicated team, and cars moving sequentially through these stages.
4. Parameter-Efficient Learning: Smart, Not Just Big
Traditional training updates all model parameters. Parameter-efficient learning aims to reduce the number of parameters that need to be updated or stored, leading to faster training, lower memory consumption, and more efficient inference.
- Low-Rank Factorization: This technique decomposes large weight matrices into products of smaller, lower-rank matrices. This reduces the total number of parameters while preserving most of the model’s expressiveness. Models like OLMo-7b-Instruct are trained using this technique.
- Knowledge Distillation (KD): This involves training a smaller “student” model to mimic the behavior of a larger “teacher” model. The student learns to approximate the teacher’s output, achieving comparable performance with fewer parameters.

- Black-box distillation is common, where the student model learns from the teacher’s outputs without needing access to its internal structure. Many instruction-tuned models, for example, are fine-tuned using data generated by powerful models like ChatGPT or GPT-4.
- Chain-of-Thought (CoT) distillation is a black-box approach where LLMs generate intermediate reasoning steps to guide smaller models in tackling complex tasks.
- In-Context Learning (ICL) distillation uses structured prompts with task descriptions and examples to enable the student model to learn new tasks by imitating the teacher’s predictions and ground truth values.
- Parameter-Efficient Fine-Tuning (PEFT): Instead of fine-tuning the entire model, PEFT methods only update a small subset of parameters while keeping the rest fixed. Adapter modules are a prime example, adding small, task-specific layers to a pre-trained model. LoRA (Low-Rank Adaptation) is a popular PEFT technique that has been applied in studies, such as fine-tuning OLMo-7b-Instruct on synthetic datasets.
Insights from Knowledge Acquisition: Studies show that LLMs primarily acquire their factual knowledge during the pre-training phase. Fine-tuning, especially with new or “unknown” factual information, can be problematic. Models learn new knowledge slower through fine-tuning, and this can actually increase their tendency to “hallucinate” (generate factually incorrect responses) with respect to their pre-existing knowledge. This suggests that fine-tuning is more about teaching the model how to utilize its already acquired knowledge more efficiently, rather than fundamentally updating its factual base. Interestingly, fine-tuning on “MaybeKnown” examples (facts with lower certainty) can be more beneficial for overall performance than solely on “HighlyKnown” ones, implying that varied certainty levels help the model handle diverse scenarios during inference.
The frequency of causal mentions in pre-training corpora has also been found to positively correlate with LLM performance in causal discovery tasks. Conversely, the presence and increased frequency of “anti-causal” relations (e.g., “effect causes cause”) in the training data can decrease an LLM’s confidence in correct causal relations.
5. Gradient Perspectives: The Mechanics of Learning
A deeper understanding of LLM training comes from analyzing gradients, which reflect how the model adjusts its parameters during learning. Researchers found that “slow thinking” (training with detailed Chain-of-Thought, or CoT) leads to more stable gradient norms across different layers of the LLM, meaning the learning process is smoother and less volatile. In contrast, “fast thinking” (training without CoT or with simplified CoT) results in larger and more drastic gradient differences across layers, potentially indicating instability.

Moreover, slow thinking allows LLMs to distinguish between correct and irrelevant reasoning paths, something fast thinking struggles with. This ability isn’t simply due to longer response lengths; merely increasing response length in knowledge-learning tasks doesn’t produce the same stable gradient patterns as detailed CoT. Additionally, learning unpopular knowledge (less frequently encountered data) triggers larger gradients, suggesting the model needs to exert more effort to integrate this information.
Think of an athlete training. If they just learn the final move (fast thinking), they might get it done, but their technique might be shaky (unstable gradients). If they meticulously practice every step of the movement, breaking it down (slow thinking/CoT), their muscle memory becomes solid, and their form is consistent (stable gradients). This also helps them distinguish good moves from bad ones.
Optimizing for Inference: Delivering the AI Magic
Once an LLM is trained, it needs to be deployed for use — a process called inference. This also has significant computational overhead, from latency (how long it takes to get a response) to throughput (how many responses per second).
1. Model Compression: Making Models Leaner
To make LLMs suitable for real-world applications, especially on devices with limited resources, their size needs to be reduced.
- Quantization: This technique reduces the precision of model parameters (e.g., from 32-bit floating-point numbers to 16-bit or even lower). This significantly cuts down memory usage and improves computational efficiency with minimal loss in accuracy.
- Sparsity/Pruning: This involves identifying and removing less important weights or connections within the neural network. Techniques like LLM-Pruner and Wanda have shown the ability to compress models with negligible accuracy loss.

2. Computation Optimization: Speeding Up Calculations
Optimizing the actual computations during inference is crucial for reducing latency.
- Attention Mechanisms: Different variations of the attention mechanism, such as Multi-Head Attention (MHA), Grouped-Query Attention (GQA), and Multi-Query Attention (MQA), offer various trade-offs between memory usage and computational speed, allowing engineers to select the best fit for their application.
3. Decoding Optimization: Crafting Responses Faster
The way an LLM generates its output, token by token, is called decoding.
- Greedy Decoding: At each step, the model simply picks the most likely next token. It’s fast but can sometimes lead to suboptimal or repetitive outputs.
- Beam Search: This explores multiple possible sequences in parallel, keeping track of several “best” options. It generally produces higher-quality outputs than greedy decoding but is computationally more expensive.
- Speculative Decoding: This innovative technique uses a smaller, faster “draft” model to predict several upcoming tokens. The main, larger LLM then quickly verifies these predictions in parallel. If verified, these tokens are accepted; if not, the larger model corrects them. This can significantly speed up inference by avoiding step-by-step generation by the large model for every token.

Imagine an LLM trying to write a complex story. Greedy decoding is like writing word-by-word, always picking the most obvious next word, sometimes leading to a bland plot. Beam search is like brainstorming a few possible next sentences, then picking the best one, resulting in a more coherent narrative, but it’s slower. Speculative decoding is like having a junior writer quickly draft entire paragraphs (draft model) which the senior editor (main model) then quickly checks and approves. This speeds up the whole writing process.
The Interplay of Training and Inference
The optimizations in training and inference are deeply interconnected. For instance, the understanding that factual knowledge is primarily embedded during pre-training guides how fine-tuning datasets are designed, focusing more on instruction following and task adaptation rather than injecting entirely new facts. Similarly, the discovery that “slow thinking” during training yields more stable models with better reasoning capabilities highlights the importance of incorporating detailed Chain-of-Thought (CoT) paths, which can then influence how the model performs during complex reasoning tasks in inference. The fact that LLMs are sensitive to context when inferring causal relations means that prompt design during inference must carefully consider contextual cues.
The goal is not just to make LLMs bigger, but to make them smarter, more adaptable, and ultimately, more practical for everyone.
The Road Ahead: Towards Green AI
The continuous scaling of LLMs necessitates an ongoing pursuit of efficiency. While significant progress has been made with techniques like FLM-101B, which achieved 80% of baseline performance with only 10% of the floating-point operations through a growth strategy where models sequentially inherited knowledge from predecessors, the journey is far from over.
Future research directions include exploring Mixture-of-Experts (MoE) models, where different parts of the model specialize in different tasks, early exit mechanisms, which allow simpler queries to be answered by earlier, smaller layers, and second-order optimization techniques.
Remember the early days of computers, when they filled entire rooms? Now, thanks to continuous optimization and ingenious engineering, we carry powerful devices in our pockets. Similarly, the journey of LLMs is toward making these incredibly complex “brains” not just powerful, but also accessible, cost-effective, and environmentally sustainable for everyone. The future of AI hinges on this relentless pursuit of efficiency.
Thanks for Reading!
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.
Published via Towards AI
Take our 90+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!
Towards AI has published Building LLMs for Production—our 470+ page guide to mastering LLMs with practical projects and expert insights!

Discover Your Dream AI Career at Towards AI Jobs
Towards AI has built a jobs board tailored specifically to Machine Learning and Data Science Jobs and Skills. Our software searches for live AI jobs each hour, labels and categorises them and makes them easily searchable. Explore over 40,000 live jobs today with Towards AI Jobs!
Note: Content contains the views of the contributing authors and not Towards AI.