Efficient Training Engine (ETE) for Large Deep Learning Models

Last Updated on February 17, 2025 by Editorial Team

Author(s): Sarvesh Khetan

Originally published on Towards AI.

1. Parallel / Distributed Training

Distributed Data Parallelism (DDP)
a. DDP Algorithm Intuition
b. DDP Algorithm
c. Code Implementation
Model Parallelism (MP)
a. Pipeline Parallelism (PP)
b. Tensor Parallelism (TP)
Hybrid / Mixed Parallelism
Combination of Data Parallel and Model Parallel i.e. you split both data and model across multiple GPUs

2. Quantization (Quantization Aware Training — QAT)

In efficient inference engine we learnt how to convert FP32 weights into INT8 weights using quantization after we have trained a model but what if we can directly learn INT8 weights instead of learning FP32 weights during model training itself??

Idea is to perform quantization in one layer and before passing it to second layer perform dequantization such that the second layer does not even know that quantization took place. These are called “fake” quants.

Goal of doing this trick is that this will introduce some quantization error into the MSE loss that is being used to train the regression model. And hence in the process of minimizing MSE loss this quantization error will also get minimized!! Thus we learn good quantized weights !!

3. Low Rank Adaptation (LoRA)

4. Quantised Low Rank Adaptation (QLoRA)

Combining method 2 and 3 i.e. Quantization followed by LORA

Distributed Data Parallelism (DDP)

Here you split the data across several GPUs while the same model in loaded in all the GPUs. This wonderful video over here explains data parallelism with animations (watch till timestamp 12.52)

Distributed Data Parallel (DDP) Algorithm Intuition

In above video you saw the speaker talk about gradient synchronization, how is it done? To understand that first we need to understand Gradient Accumulation

Now we will use this Gradient Accumulation concept to synchronize gradients in Distributed Data Parallel !!

This step in more than 2 GPU step up can be generalized as follows :

Distributed Data Parallel (DDP) Algorithm

At the beginning of the training, the model’s weights are randomly initialized on one GPU
These weights are now sent to all other GPUs (Broadcast Operation)

Each GPU trains the same model on a subset of dataset and calculates gradients
Now the gradients from each GPU are accumulated / summed up on one GPU (Reduce Operation)

These accumulated gradients are used to calculate new parameters of the model which is then sent back to all other GPUs and the process continues !! (Repeat again from step 2)

This continuous process of Broadcast and Reduce is called “ALL REDUCE”

Code Implementation

'''
Step 0 : Load necessary libraries for Distributed Training
'''
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
from torch.distributed import init_process_group, destroy_process_group

'''
Step 1 : Load Local Rank and Global Rank of GPUs
'''
local_rank = int(os.environ['LOCAL_RANK'])
global_rank = int(os.environ['RANK'])

'''
Step 2 : Tell pytorch that you are performing distributed training on CUDA GPUs
'''
init_process_group(backend='nccl') #NCCL is a communication framework for CUDA GPUs
torch.cuda.set_device(local_rank) # Set the device to local rank

'''
Step 3 : Running Distributed Training Loop
'''
# setting up distributed dataloader to send data to all GPUs
data_loader = DataLoader(train_dataset, 
 shuffle=False, # has to be kept false because distributed one is kept true
 sampler=DistributedSampler(train_dataset, shuffle=True)
 )
# sending model to all the GPUs
model = DistributedDataParallel(MyModel(), device_ids=[local_rank])
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, eps=1e-9)
loss_fn = torch.nn.CrossEntropyLoss()

for epoch in range(num_epochs):
 for data, labels in data_loader:
 loss = loss_fn(model(data), labels) # Forward step
 loss.backward() # Backward step + gradient synchronization
 optimizer.step() # Update weights
 optimizer.zero_grad() # Reset gradients to zero

 if global_rank == 0: # Only save the GPU0 model
 torch.save(model.state_dict(), 'latest_checkpoint.pth')
 #Also save the optimizer state and other variables needed 
 #to restore the training state

'''
Step 4 : Destroy all the GPU resource created
'''
destroy_process_group()

Model Parallelism (MP)

Here you split the model across GPUs instead of splitting the data. You can divide the model in two ways :

1. Pipeline Parallelism (PP)

Here you cut the model VERTICALLY and distribute it to different GPUs e.g. each GPU is assigned 2 layers of the model (as shown below)

A big issue with this method is GPU utilization. When GPU 4 is performing forward pass on the outputs of GPU3, other GPUs i.e. GPU1, GPU2 and GPU3 are in idle state. Now imagine you are using 1000s of GPU in a big architecture then you can imagine the amount of idle GPUs. Hence in this method GPU utilization is not great.

2. Tensor Parallelism (TP)

Here you cut the model HORIZONTALLY and distribute it to different GPUs

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Efficient Training Engine (ETE) for Large Deep Learning Models

Author(s): Sarvesh Khetan

Table of Contents :

1. Parallel / Distributed Training

2. Quantization (Quantization Aware Training — QAT)

3. Low Rank Adaptation (LoRA)

4. Quantised Low Rank Adaptation (QLoRA)

Distributed Data Parallelism (DDP)

Distributed Data Parallel (DDP) Algorithm Intuition

Distributed Data Parallel (DDP) Algorithm

Code Implementation

Model Parallelism (MP)

1. Pipeline Parallelism (PP)

2. Tensor Parallelism (TP)

Feedback ↓ Cancel reply

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

LAI #66: Information Theory for People in a Hurry

🔎 Decoding LLM Pipeline — Step 1: Input Processing & Tokenization

Meta to Launch Its Own In-House AI Chip

I Built an AI Money Coach in Python — Here’s How You Can Too (Step-by-Step Guide!)

ChatGPT Now Works Natively in Xcode and VS Code

The World’s Leading AI and Technology Publication.

Company

CONTACT US

🔥 Recommended Articles 🔥

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Efficient Training Engine (ETE) for Large Deep Learning Models

Author(s): Sarvesh Khetan

Table of Contents :

1. Parallel / Distributed Training

2. Quantization (Quantization Aware Training — QAT)

3. Low Rank Adaptation (LoRA)

4. Quantised Low Rank Adaptation (QLoRA)

Distributed Data Parallelism (DDP)

Distributed Data Parallel (DDP) Algorithm Intuition

Distributed Data Parallel (DDP) Algorithm

Code Implementation

Model Parallelism (MP)

1. Pipeline Parallelism (PP)

2. Tensor Parallelism (TP)

Related posts

Feedback ↓ Cancel reply

Popular posts

Updates

Recent Posts

The World’s Leading AI and Technology Publication.

Company

CONTACT US

GDPR CCPA Statement

Subscribe to our AI newsletter!

🔥 Recommended Articles 🔥