Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Take our 85+ lesson From Beginner to Advanced LLM Developer Certification: From choosing a project to deploying a working product this is the most comprehensive and practical LLM course out there!

Publication

Efficient Training Engine (ETE) for Large Deep Learning Models
Artificial Intelligence   Data Science   Latest   Machine Learning

Efficient Training Engine (ETE) for Large Deep Learning Models

Last Updated on February 17, 2025 by Editorial Team

Author(s): Sarvesh Khetan

Originally published on Towards AI.

Table of Contents :

There are many ways to efficiently train a large DL model

1. Parallel / Distributed Training

2. Quantization (Quantization Aware Training β€” QAT)

In efficient inference engine we learnt how to convert FP32 weights into INT8 weights using quantization after we have trained a model but what if we can directly learn INT8 weights instead of learning FP32 weights during model training itself??

Idea is to perform quantization in one layer and before passing it to second layer perform dequantization such that the second layer does not even know that quantization took place. These are called β€œfake” quants.

Goal of doing this trick is that this will introduce some quantization error into the MSE loss that is being used to train the regression model. And hence in the process of minimizing MSE loss this quantization error will also get minimized!! Thus we learn good quantized weights !!

3. Low Rank Adaptation (LoRA)

4. Quantised Low Rank Adaptation (QLoRA)

Combining method 2 and 3 i.e. Quantization followed by LORA

Distributed Data Parallelism (DDP)

Here you split the data across several GPUs while the same model in loaded in all the GPUs. This wonderful video over here explains data parallelism with animations (watch till timestamp 12.52)

Distributed Data Parallel (DDP) Algorithm Intuition

In above video you saw the speaker talk about gradient synchronization, how is it done? To understand that first we need to understand Gradient Accumulation

Now we will use this Gradient Accumulation concept to synchronize gradients in Distributed Data Parallel !!

This step in more than 2 GPU step up can be generalized as follows :

Distributed Data Parallel (DDP) Algorithm

  • At the beginning of the training, the model’s weights are randomly initialized on one GPU
  • These weights are now sent to all other GPUs (Broadcast Operation)
  • Each GPU trains the same model on a subset of dataset and calculates gradients
  • Now the gradients from each GPU are accumulated / summed up on one GPU (Reduce Operation)
  • These accumulated gradients are used to calculate new parameters of the model which is then sent back to all other GPUs and the process continues !! (Repeat again from step 2)

This continuous process of Broadcast and Reduce is called β€œALL REDUCE”

Code Implementation

'''
Step 0 : Load necessary libraries for Distributed Training
'''

from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
from torch.distributed import init_process_group, destroy_process_group

'''
Step 1 : Load Local Rank and Global Rank of GPUs
'''

local_rank = int(os.environ['LOCAL_RANK'])
global_rank = int(os.environ['RANK'])

'''
Step 2 : Tell pytorch that you are performing distributed training on CUDA GPUs
'''

init_process_group(backend='nccl') #NCCL is a communication framework for CUDA GPUs
torch.cuda.set_device(local_rank) # Set the device to local rank

'''
Step 3 : Running Distributed Training Loop
'''

# setting up distributed dataloader to send data to all GPUs
data_loader = DataLoader(train_dataset,
shuffle=False, # has to be kept false because distributed one is kept true
sampler=DistributedSampler(train_dataset, shuffle=True)
)
# sending model to all the GPUs
model = DistributedDataParallel(MyModel(), device_ids=[local_rank])
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, eps=1e-9)
loss_fn = torch.nn.CrossEntropyLoss()

for epoch in range(num_epochs):
for data, labels in data_loader:
loss = loss_fn(model(data), labels) # Forward step
loss.backward() # Backward step + gradient synchronization
optimizer.step() # Update weights
optimizer.zero_grad() # Reset gradients to zero

if global_rank == 0: # Only save the GPU0 model
torch.save(model.state_dict(), 'latest_checkpoint.pth')
#Also save the optimizer state and other variables needed
#to restore the training state

'''
Step 4 : Destroy all the GPU resource created
'''

destroy_process_group()

Model Parallelism (MP)

Here you split the model across GPUs instead of splitting the data. You can divide the model in two ways :

1. Pipeline Parallelism (PP)

Here you cut the model VERTICALLY and distribute it to different GPUs e.g. each GPU is assigned 2 layers of the model (as shown below)

A big issue with this method is GPU utilization. When GPU 4 is performing forward pass on the outputs of GPU3, other GPUs i.e. GPU1, GPU2 and GPU3 are in idle state. Now imagine you are using 1000s of GPU in a big architecture then you can imagine the amount of idle GPUs. Hence in this method GPU utilization is not great.

2. Tensor Parallelism (TP)

Here you cut the model HORIZONTALLY and distribute it to different GPUs

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓