data:image/s3,"s3://crabby-images/3b89b/3b89b1c20b7c061399da9d95a9e1e08049d49940" alt="Efficient Training Engine (ETE) for Large Deep Learning Models Efficient Training Engine (ETE) for Large Deep Learning Models"
Efficient Training Engine (ETE) for Large Deep Learning Models
Last Updated on February 17, 2025 by Editorial Team
Author(s): Sarvesh Khetan
Originally published on Towards AI.
Table of Contents :
There are many ways to efficiently train a large DL model
1. Parallel / Distributed Training
- Distributed Data Parallelism (DDP)
a. DDP Algorithm Intuition
b. DDP Algorithm
c. Code Implementation - Model Parallelism (MP)
a. Pipeline Parallelism (PP)
b. Tensor Parallelism (TP) - Hybrid / Mixed Parallelism
Combination of Data Parallel and Model Parallel i.e. you split both data and model across multiple GPUs
2. Quantization (Quantization Aware Training β QAT)
In efficient inference engine we learnt how to convert FP32 weights into INT8 weights using quantization after we have trained a model but what if we can directly learn INT8 weights instead of learning FP32 weights during model training itself??
Idea is to perform quantization in one layer and before passing it to second layer perform dequantization such that the second layer does not even know that quantization took place. These are called βfakeβ quants.
Goal of doing this trick is that this will introduce some quantization error into the MSE loss that is being used to train the regression model. And hence in the process of minimizing MSE loss this quantization error will also get minimized!! Thus we learn good quantized weights !!
3. Low Rank Adaptation (LoRA)
4. Quantised Low Rank Adaptation (QLoRA)
Combining method 2 and 3 i.e. Quantization followed by LORA
Distributed Data Parallelism (DDP)
Here you split the data across several GPUs while the same model in loaded in all the GPUs. This wonderful video over here explains data parallelism with animations (watch till timestamp 12.52)
Distributed Data Parallel (DDP) Algorithm Intuition
In above video you saw the speaker talk about gradient synchronization, how is it done? To understand that first we need to understand Gradient Accumulation
Now we will use this Gradient Accumulation concept to synchronize gradients in Distributed Data Parallel !!
This step in more than 2 GPU step up can be generalized as follows :
Distributed Data Parallel (DDP) Algorithm
- At the beginning of the training, the modelβs weights are randomly initialized on one GPU
- These weights are now sent to all other GPUs (Broadcast Operation)
- Each GPU trains the same model on a subset of dataset and calculates gradients
- Now the gradients from each GPU are accumulated / summed up on one GPU (Reduce Operation)
- These accumulated gradients are used to calculate new parameters of the model which is then sent back to all other GPUs and the process continues !! (Repeat again from step 2)
This continuous process of Broadcast and Reduce is called βALL REDUCEβ
Code Implementation
'''
Step 0 : Load necessary libraries for Distributed Training
'''
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel
from torch.distributed import init_process_group, destroy_process_group
'''
Step 1 : Load Local Rank and Global Rank of GPUs
'''
local_rank = int(os.environ['LOCAL_RANK'])
global_rank = int(os.environ['RANK'])
'''
Step 2 : Tell pytorch that you are performing distributed training on CUDA GPUs
'''
init_process_group(backend='nccl') #NCCL is a communication framework for CUDA GPUs
torch.cuda.set_device(local_rank) # Set the device to local rank
'''
Step 3 : Running Distributed Training Loop
'''
# setting up distributed dataloader to send data to all GPUs
data_loader = DataLoader(train_dataset,
shuffle=False, # has to be kept false because distributed one is kept true
sampler=DistributedSampler(train_dataset, shuffle=True)
)
# sending model to all the GPUs
model = DistributedDataParallel(MyModel(), device_ids=[local_rank])
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4, eps=1e-9)
loss_fn = torch.nn.CrossEntropyLoss()
for epoch in range(num_epochs):
for data, labels in data_loader:
loss = loss_fn(model(data), labels) # Forward step
loss.backward() # Backward step + gradient synchronization
optimizer.step() # Update weights
optimizer.zero_grad() # Reset gradients to zero
if global_rank == 0: # Only save the GPU0 model
torch.save(model.state_dict(), 'latest_checkpoint.pth')
#Also save the optimizer state and other variables needed
#to restore the training state
'''
Step 4 : Destroy all the GPU resource created
'''
destroy_process_group()
Model Parallelism (MP)
Here you split the model across GPUs instead of splitting the data. You can divide the model in two ways :
1. Pipeline Parallelism (PP)
Here you cut the model VERTICALLY and distribute it to different GPUs e.g. each GPU is assigned 2 layers of the model (as shown below)
A big issue with this method is GPU utilization. When GPU 4 is performing forward pass on the outputs of GPU3, other GPUs i.e. GPU1, GPU2 and GPU3 are in idle state. Now imagine you are using 1000s of GPU in a big architecture then you can imagine the amount of idle GPUs. Hence in this method GPU utilization is not great.
2. Tensor Parallelism (TP)
Here you cut the model HORIZONTALLY and distribute it to different GPUs
Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.
Published via Towards AI