Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Read by thought-leaders and decision-makers around the world. Phone Number: +1-650-246-9381 Email: [email protected]
228 Park Avenue South New York, NY 10003 United States
Website: Publisher: https://towardsai.net/#publisher Diversity Policy: https://towardsai.net/about Ethics Policy: https://towardsai.net/about Masthead: https://towardsai.net/about
Name: Towards AI Legal Name: Towards AI, Inc. Description: Towards AI is the world's leading artificial intelligence (AI) and technology publication. Founders: Roberto Iriondo, , Job Title: Co-founder and Advisor Works for: Towards AI, Inc. Follow Roberto: X, LinkedIn, GitHub, Google Scholar, Towards AI Profile, Medium, ML@CMU, FreeCodeCamp, Crunchbase, Bloomberg, Roberto Iriondo, Generative AI Lab, Generative AI Lab Denis Piffaretti, Job Title: Co-founder Works for: Towards AI, Inc. Louie Peters, Job Title: Co-founder Works for: Towards AI, Inc. Louis-François Bouchard, Job Title: Co-founder Works for: Towards AI, Inc. Cover:
Towards AI Cover
Logo:
Towards AI Logo
Areas Served: Worldwide Alternate Name: Towards AI, Inc. Alternate Name: Towards AI Co. Alternate Name: towards ai Alternate Name: towardsai Alternate Name: towards.ai Alternate Name: tai Alternate Name: toward ai Alternate Name: toward.ai Alternate Name: Towards AI, Inc. Alternate Name: towardsai.net Alternate Name: pub.towardsai.net
5 stars – based on 497 reviews

Frequently Used, Contextual References

TODO: Remember to copy unique IDs whenever it needs used. i.e., URL: 304b2e42315e

Resources

Unlock the full potential of AI with Building LLMs for Productionβ€”our 470+ page guide to mastering LLMs with practical projects and expert insights!

Publication

Deepspeed ZeRO-DP: distributed training for large models
Latest   Machine Learning

Deepspeed ZeRO-DP: distributed training for large models

Last Updated on June 11, 2024 by Editorial Team

Author(s): Amina Shabbeer

Originally published on Towards AI.

Deepspeed’s ZeRO (Zero Redundancy Optimizer) is a distributed training framework with a number of optimizations to easily train large deep learning models across multiple GPUs and nodes. These optimizations reduce memory redundancy and communication volume making it worthwhile spreading training across multiple devices. A key benefit is that model developers need not really think about these systems-level optimizations, and can do distributed training using their existing PyTorch model code with a few config settings and boiler-plate code from the deepspeed library. If you are a model developer seeking a quick description of what’s happening under the hood, this article is for you.

In this article, we focus on ZeRO-DP, data-parallel with zero redundancy sharding. We leave a description of residual memory savings in ZeRO-R for a future article.

ZeRO-DP combines the benefits of model, pipeline and data parallelism while overcoming the individual limitations of each. So let us first recall model and data parallelism.

Recap: Data and Model Parallelism

In Data Parallelism (DP), a model is replicated across multiple devices (e.g., GPUs), and each device processes a different subset of the data. After processing, gradients are aggregated and model parameters are synchronized. In Model Parallelism (MP), different parts of the model are distributed across multiple devices. Each device is responsible for computing a different portion of the model’s operations. The illustration below shows the main idea.

Schematic comparing model and data parallelism
Comparison of features and limitations of model parallel and data parallel

See Table above for a summary of features and limitations of DP vs MP. DP has good compute/communication efficiency but poor memory efficiency (each device keeps a copy of the model). MP has good memory efficiency but can have poor communication efficiency due to the need to synchronize across model partitions. ZeRO-DP aims to get the best of both worlds and be both memory and compute efficient:

  • It remains memory efficient by eliminating redundancy of model parallelism by partitioning model states instead of replicating them as in DP. Each device stores a mutually exclusive subset of the model state.
  • It remains compute/communication efficient by using a dynamic communication strategy. A key insight is that not all states are needed on all devices at all times. When a state is needed during training it is broadcast by the owner device to other devices. And once used it is discarded.

Where does the memory go?

If you’ve tried training a large recommender system or fine-tuning even a relatively small LLM like llama-7B, you’ve probably seen frustrating OOM errors. So why doesn’t a 7B parameter model, assuming fp32 or 4 bytes per parameter weight, fit on a GPU with 7B*4=28GB memory? Beyond parameters, each device also needs access to the gradients (same size as parameters) and optimizer states e.g., first and second-order moments of the gradients if using an ADAM optimizer (double the size of the parameters).

If using mixed precision training, we get away with fp16 for parameters and gradients i.e. 4P bytes for a model with P parameters, 2P each for params and gradients respectively. But still need to maintain fp32 copies of optimizer states needing 12P bytes memory for the backward pass. (4P for params and 4P bytes each for first and second order moments of gradients).

ZeRO provides three optimizer stages by increasingly partitioning model states.

  • Optimizer State Partitioning: P_os
  • Gradient Partitioning: P_os+g
  • Parameter Partitioning: P_os+g+p
[From the Deepspeed paper] A visualization of memory requirements in the three optimizer stages

It is still a data parallel mode, but behind the scenes eliminates redundancy in model states. Let’s consider Stage 1 optimization (P_os) while training across N devices. Behind the scenes ZeRO partitions the optimizer state into N partitions. Each device owns updating one partition i.e., 1/N of the optimizer state and hence 1/N of the corresponding parameters. At the end of a training step the parameters are synchronized (all-gather), so consistently updated parameters are available on all devices. In the mixed precision training example above, the memory requirement becomes 4P+12P/N, which for large N tends to 4P, a 4x reduction over vanilla DP needing 16P. Further, the memory savings from saving less state on each device affords larger batch sizes, making training even more efficient.

Similarly, with Stage 2, P_os+g. With each device owning updating 1/N of the parameters, we only need 1/N of the gradients on each device during the backward pass. Further, once the gradient partition is used for the backward pass it can be released. In the mixed precision training example above, the memory requirement becomes 2P+(2P+12P)/N, and for large N becomes 2P a 8x reduction.

Finally, in the P_os+g+p stage, why store all parameters on each device all the time if each device only owns updating 1/N of the parameters? Storing only 1/N parameters on each device means each device in the example needs 16P/N memory. This implies as long as we have enough devices to distribute the parameters over, we could scale an arbitrarily large model. Can this be true? What about communication costs in synchronizing parameters at the end of each training step?

How much communication?

If we distribute partitions across an arbitrary number of nodes, don’t we have to pay the price in communication costs keeping the partitions in sync? The key is that not all parameters are needed on all devices at all times. By limiting when they sync the communication costs are similar to what we incur in classical DP.

In classic DP, after the backward pass each device has a local gradient computed w.r.t. local data. Then local gradients across all devices are averaged. Each device use the averaged gradient to do a weight update. Since all devices have the same averaged gradient, they would do the same update and therefore all devices would have a consistent copy of the model. For efficiency, this all-reduce operation on gradients is implemented in two steps below and needs a total of 2P communication for a P param model:

  • reduce-scatter: each process averages part of the gradient [O(P) communication of gradient of size P for a P parameter model]
  • all-gather: each process gathers gradients reduced on all other processes. [Also O(P) communication of gradient of size P for a P parameter model]

Both steps are pipelined so the process is communication bound and GPUs are not idle.

P_os+g: The communication needed is the same as the classic DP. A reduce-scatter operation needs P communication to reduce the part of the gradient each process owns. Each process only needs to update the corresponding part of the parameters it owns. It then communicates its updated params to all other devices for a total communication of P for this all-gather operation. Again, the total communication is 2P.

P_os+g+p: In this optimizer stage, only 1/N of the P model parameters are stored on each device. So, each process needs to communicate P/N (part of the parameters) to all N devices for both forward and backward passes. i.e. communication of P/N*N = P for each pass, a total of 2P. The reduce-scatter on gradients needs communication of P as above. The total communication is 3P i.e., 1.5 times classic DP. The communication is spread out so that the parameters are only available on a node when needed and discarded right after, thus preserving the memory-saving properties discussed.

Conclusion

Deepspeed provides a sophisticated distributed training strategy that reduces memory redundancy as compared to classic DP while doing similar levels of communication. The library is well-designed allowing model developers to use it without really having to understand any of the mechanics described above. If you want to know anyway, hopefully this article provides an overview of what’s happening under the hood. Follow for similar articles.

References:

  1. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models https://arxiv.org/pdf/1910.02054v3
  2. https://www.deepspeed.ai/
  3. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/operations.html

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming aΒ sponsor.

Published via Towards AI

Feedback ↓