Deepspeed ZeRO-DP: distributed training for large models

Last Updated on June 11, 2024 by Editorial Team

Author(s): Amina Shabbeer

Originally published on Towards AI.

Deepspeed’s ZeRO (Zero Redundancy Optimizer) is a distributed training framework with a number of optimizations to easily train large deep learning models across multiple GPUs and nodes. These optimizations reduce memory redundancy and communication volume making it worthwhile spreading training across multiple devices. A key benefit is that model developers need not really think about these systems-level optimizations, and can do distributed training using their existing PyTorch model code with a few config settings and boiler-plate code from the deepspeed library. If you are a model developer seeking a quick description of what’s happening under the hood, this article is for you.

In this article, we focus on ZeRO-DP, data-parallel with zero redundancy sharding. We leave a description of residual memory savings in ZeRO-R for a future article.

ZeRO-DP combines the benefits of model, pipeline and data parallelism while overcoming the individual limitations of each. So let us first recall model and data parallelism.

Recap: Data and Model Parallelism

In Data Parallelism (DP), a model is replicated across multiple devices (e.g., GPUs), and each device processes a different subset of the data. After processing, gradients are aggregated and model parameters are synchronized. In Model Parallelism (MP), different parts of the model are distributed across multiple devices. Each device is responsible for computing a different portion of the model’s operations. The illustration below shows the main idea.

Deepspeed ZeRO-DP: distributed training for large models — Schematic comparing model and data parallelism

Comparison of features and limitations of model parallel and data parallel

See Table above for a summary of features and limitations of DP vs MP. DP has good compute/communication efficiency but poor memory efficiency (each device keeps a copy of the model). MP has good memory efficiency but can have poor communication efficiency due to the need to synchronize across model partitions. ZeRO-DP aims to get the best of both worlds and be both memory and compute efficient:

It remains memory efficient by eliminating redundancy of model parallelism by partitioning model states instead of replicating them as in DP. Each device stores a mutually exclusive subset of the model state.
It remains compute/communication efficient by using a dynamic communication strategy. A key insight is that not all states are needed on all devices at all times. When a state is needed during training it is broadcast by the owner device to other devices. And once used it is discarded.

Where does the memory go?

If you’ve tried training a large recommender system or fine-tuning even a relatively small LLM like llama-7B, you’ve probably seen frustrating OOM errors. So why doesn’t a 7B parameter model, assuming fp32 or 4 bytes per parameter weight, fit on a GPU with 7B*4=28GB memory? Beyond parameters, each device also needs access to the gradients (same size as parameters) and optimizer states e.g., first and second-order moments of the gradients if using an ADAM optimizer (double the size of the parameters).

If using mixed precision training, we get away with fp16 for parameters and gradients i.e. 4P bytes for a model with P parameters, 2P each for params and gradients respectively. But still need to maintain fp32 copies of optimizer states needing 12P bytes memory for the backward pass. (4P for params and 4P bytes each for first and second order moments of gradients).

ZeRO provides three optimizer stages by increasingly partitioning model states.

Optimizer State Partitioning: P_os
Gradient Partitioning: P_os+g
Parameter Partitioning: P_os+g+p

[From the Deepspeed paper] A visualization of memory requirements in the three optimizer stages

It is still a data parallel mode, but behind the scenes eliminates redundancy in model states. Let’s consider Stage 1 optimization (P_os) while training across N devices. Behind the scenes ZeRO partitions the optimizer state into N partitions. Each device owns updating one partition i.e., 1/N of the optimizer state and hence 1/N of the corresponding parameters. At the end of a training step the parameters are synchronized (all-gather), so consistently updated parameters are available on all devices. In the mixed precision training example above, the memory requirement becomes 4P+12P/N, which for large N tends to 4P, a 4x reduction over vanilla DP needing 16P. Further, the memory savings from saving less state on each device affords larger batch sizes, making training even more efficient.

Similarly, with Stage 2, P_os+g. With each device owning updating 1/N of the parameters, we only need 1/N of the gradients on each device during the backward pass. Further, once the gradient partition is used for the backward pass it can be released. In the mixed precision training example above, the memory requirement becomes 2P+(2P+12P)/N, and for large N becomes 2P a 8x reduction.

Finally, in the P_os+g+p stage, why store all parameters on each device all the time if each device only owns updating 1/N of the parameters? Storing only 1/N parameters on each device means each device in the example needs 16P/N memory. This implies as long as we have enough devices to distribute the parameters over, we could scale an arbitrarily large model. Can this be true? What about communication costs in synchronizing parameters at the end of each training step?

How much communication?

If we distribute partitions across an arbitrary number of nodes, don’t we have to pay the price in communication costs keeping the partitions in sync? The key is that not all parameters are needed on all devices at all times. By limiting when they sync the communication costs are similar to what we incur in classical DP.

In classic DP, after the backward pass each device has a local gradient computed w.r.t. local data. Then local gradients across all devices are averaged. Each device use the averaged gradient to do a weight update. Since all devices have the same averaged gradient, they would do the same update and therefore all devices would have a consistent copy of the model. For efficiency, this all-reduce operation on gradients is implemented in two steps below and needs a total of 2P communication for a P param model:

reduce-scatter: each process averages part of the gradient [O(P) communication of gradient of size P for a P parameter model]
all-gather: each process gathers gradients reduced on all other processes. [Also O(P) communication of gradient of size P for a P parameter model]

Both steps are pipelined so the process is communication bound and GPUs are not idle.

P_os+g: The communication needed is the same as the classic DP. A reduce-scatter operation needs P communication to reduce the part of the gradient each process owns. Each process only needs to update the corresponding part of the parameters it owns. It then communicates its updated params to all other devices for a total communication of P for this all-gather operation. Again, the total communication is 2P.

P_os+g+p: In this optimizer stage, only 1/N of the P model parameters are stored on each device. So, each process needs to communicate P/N (part of the parameters) to all N devices for both forward and backward passes. i.e. communication of P/N*N = P for each pass, a total of 2P. The reduce-scatter on gradients needs communication of P as above. The total communication is 3P i.e., 1.5 times classic DP. The communication is spread out so that the parameters are only available on a node when needed and discarded right after, thus preserving the memory-saving properties discussed.

Conclusion

Deepspeed provides a sophisticated distributed training strategy that reduces memory redundancy as compared to classic DP while doing similar levels of communication. The library is well-designed allowing model developers to use it without really having to understand any of the mechanics described above. If you want to know anyway, hopefully this article provides an overview of what’s happening under the hood. Follow for similar articles.

References:

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models https://arxiv.org/pdf/1910.02054v3
https://www.deepspeed.ai/
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Frequently Used, Contextual References

Resources

Publication

Deepspeed ZeRO-DP: distributed training for large models

Author(s): Amina Shabbeer

Recap: Data and Model Parallelism

Where does the memory go?

How much communication?

Conclusion

References:

Popular posts

Best Laptops for Deep Learning, Machine Learning (ML), and Data Science for 2023

Best Workstations for Deep Learning, Data Science, and Machine Learning (ML) for 2022

Descriptive Statistics for Data-driven Decision Making with Python

Best Machine Learning (ML) Books - Free and Paid - Editorial Recommendations for 2022

Best Data Science Books - Free and Paid - Editorial Recommendations for 2022

Updates

Recent Posts

Why Knowledge Graphs Are the Missing Piece in AI Agent API Discovery

The Complexity of Self-Driving Cars Explained Simply

Bridging Symbolic AI and Deep Learning: How Knowledge Graphs are Revolutionizing ResNets

LAI #93: Smarter Model Choices, Multi-Agent Systems, and Cutting Through AI Noise

Who Wins Purview vs Rogue AI in Data Control

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Publication

Deepspeed ZeRO-DP: distributed training for large models

Author(s): Amina Shabbeer

Recap: Data and Model Parallelism

Where does the memory go?

How much communication?

Conclusion

References:

Related posts

Popular posts

Updates

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement