How to use DeepSpeed
DeepSpeed
DeepSpeed implements everything described in the ZeRO paper. Currently, it provides full support for:
Optimizer state partitioning (ZeRO stage 1)
Gradient partitioning (ZeRO stage 2)
Parameter partitioning (ZeRO stage 3)
Custom mixed precision training handling
A range of fast CUDA-extension-based optimizers
ZeRO-Offload to CPU and Disk/NVMe
ZeRO-Offload has its own dedicated paper: ZeRO-Offload: Democratizing Billion-Scale Model Training. And NVMe-support is described in the paper ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning.
DeepSpeed ZeRO-2 is primarily used only for training, as its features are of no use to inference.
DeepSpeed ZeRO-3 can be used for inference as well since it allows huge models to be loaded on multiple GPUs, which won’t be possible on a single GPU.
🌍 Accelerate integrates DeepSpeed via 2 options:
Integration of the DeepSpeed features via
deepspeed config filespecification inaccelerate config. You just supply your custom config file or use our template. Most of this document is focused on this feature. This supports all the core features of DeepSpeed and gives user a lot of flexibility. User may have to change a few lines of code depending on the config.Integration via
deepspeed_plugin.This supports subset of the DeepSpeed features and uses default options for the rest of the configurations. User need not change any code and is good for those who are fine with most of the default settings of DeepSpeed.
What is integrated?
Training:
DeepSpeed ZeRO training supports the full ZeRO stages 1, 2 and 3 as well as CPU/Disk offload of optimizer states, gradients and parameters. Below is a short description of Data Parallelism using ZeRO - Zero Redundancy Optimizer along with diagram from this blog post

(Source: link)
a. Stage 1 : Shards optimizer states across data parallel workers/GPUs
b. Stage 2 : Shards optimizer states + gradients across data parallel workers/GPUs
c. Stage 3: Shards optimizer states + gradients + model parameters across data parallel workers/GPUs
d. Optimizer Offload: Offloads the gradients + optimizer states to CPU/Disk building on top of ZERO Stage 2
e. Param Offload: Offloads the model parameters to CPU/Disk building on top of ZERO Stage 3
Note: With respect to Disk Offload, the disk should be an NVME for decent speed but it technically works on any Disk
Inference:
DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but it doesn’t use an optimizer and a lr scheduler and only stage 3 is relevant. For more details see: deepspeed-zero-inference.
How it works?
Pre-Requisites: Install DeepSpeed version >=0.6.5. Please refer to the DeepSpeed Installation details for more information.
We will first look at easy to use integration via accelerate config. Followed by more flexible and feature rich deepspeed config file integration.
Accelerate DeepSpeed Plugin
On your machine(s) just run:
Copied
and answer the questions asked. It will ask whether you want to use a config file for DeepSpeed to which you should answer no. Then answer the following questions to generate a basic DeepSpeed config. This will generate a config file that will be used automatically to properly set the default options when doing
Copied
For instance, here is how you would run the NLP example examples/nlp_example.py (from the root of the repo) with DeepSpeed Plugin:
ZeRO Stage-2 DeepSpeed Plugin Example
Copied
Copied
ZeRO Stage-3 with CPU Offload DeepSpeed Plugin Example
Copied
Copied
Currently, Accelerate supports following config through the CLI:
Copied
To be able to tweak more options, you will need to use a DeepSpeed config file.
DeepSpeed Config File
On your machine(s) just run:
Copied
and answer the questions asked. It will ask whether you want to use a config file for deepspeed to which you answer yes and provide the path to the deepspeed config file. This will generate a config file that will be used automatically to properly set the default options when doing
Copied
For instance, here is how you would run the NLP example examples/by_feature/deepspeed_with_config_support.py (from the root of the repo) with DeepSpeed Config File:
ZeRO Stage-2 DeepSpeed Config File Example
Copied
with the contents of zero_stage2_config.json being:
Copied
Copied
ZeRO Stage-3 with CPU offload DeepSpeed Config File Example
Copied
with the contents of zero_stage3_offload_config.json being:
Copied
Copied
Important code changes when using DeepSpeed Config File
DeepSpeed Optimizers and Schedulers. For more information on these, see the DeepSpeed Optimizers and DeepSpeed Schedulers documentation. We will look at the changes needed in the code when using these.
a. DS Optim + DS Scheduler: The case when both
optimizerandschedulerkeys are present in the DeepSpeed config file. In this situation, those will be used and the user has to useaccelerate.utils.DummyOptimandaccelerate.utils.DummySchedulerto replace the PyTorch/Custom optimizers and schedulers in their code. Below is the snippet fromexamples/by_feature/deepspeed_with_config_support.pyshowing this:Copied
b. Custom Optim + Custom Scheduler: The case when both
optimizerandschedulerkeys are absent in the DeepSpeed config file. In this situation, no code changes are needed from the user and this is the case when using integration via DeepSpeed Plugin. In the above example we can see that the code remains unchanged if theoptimizerandschedulerkeys are absent in the DeepSpeed config file.c. Custom Optim + DS Scheduler: The case when only
schedulerkey is present in the DeepSpeed config file. In this situation, the user has to useaccelerate.utils.DummySchedulerto replace the PyTorch/Custom scheduler in their code.d. DS Optim + Custom Scheduler: The case when only
optimizerkey is present in the DeepSpeed config file. This will result in an error because you can only use DS Scheduler when using DS Optim.Notice the
autovalues in the above example DeepSpeed config files. These are automatically handled bypreparemethod based on model, dataloaders, dummy optimizer and dummy schedulers provided topreparemethod. Only theautofields specified in above examples are handled bypreparemethod and the rest have to be explicitly specified by the user.
Things to note when using DeepSpeed Config File
Below is a sample script using deepspeed_config_file in different scenarios.
Code test.py:
Copied
Scenario 1: Manually tampered accelerate config file having deepspeed_config_file along with other entries.
Content of the
accelerateconfig:
Copied
ds_config.json:
Copied
Output of
accelerate launch test.py:
Copied
Scenario 2: Use the solution of the error to create new accelerate config and check that no ambiguity error is now thrown.
Run
accelerate config:
Copied
Content of the
accelerateconfig:
Copied
Output of
accelerate launch test.py:
Copied
Scenario 3: Setting the accelerate launch command arguments related to DeepSpeed as "auto" in the DeepSpeed` configuration file and check that things work as expected.
New
ds_config.jsonwith"auto"for theaccelerate launchDeepSpeed command arguments:
Copied
Output of
accelerate launch --mixed_precision="fp16" --zero_stage=3 --gradient_accumulation_steps=5 --gradient_clipping=1.0 --offload_param_device="cpu" --offload_optimizer_device="nvme" --zero3_save_16bit_model="true" test.py:
Copied
Note:
Remaining
"auto"values are handled inaccelerator.prepare()call as explained in point 2 ofImportant code changes when using DeepSpeed Config File.Only when
gradient_accumulation_stepsisauto, the value passed while creatingAcceleratorobject viaAccelerator(gradient_accumulation_steps=k)will be used. When using DeepSpeed Plugin, the value from it will be used and it will overwrite the value passed while creating Accelerator object.
Saving and loading
Saving and loading of models is unchanged for ZeRO Stage-1 and Stage-2.
under ZeRO Stage-3,
state_dictcontains just the placeholders since the model weights are partitioned across multiple GPUs. ZeRO Stage-3 has 2 options:a. Saving the entire 16bit model weights to directly load later on using
model.load_state_dict(torch.load(pytorch_model.bin)). For this, either setzero_optimization.stage3_gather_16bit_weights_on_model_saveto True in DeepSpeed Config file or setzero3_save_16bit_modelto True in DeepSpeed Plugin. Note that this option requires consolidation of the weights on one GPU it can be slow and memory demanding, so only use this feature when needed. Below is the snippet fromexamples/by_feature/deepspeed_with_config_support.pyshowing this:Copied
b. To get 32bit weights, first save the model using
model.save_checkpoint(). Below is the snippet fromexamples/by_feature/deepspeed_with_config_support.pyshowing this:Copied
This will create ZeRO model and optimizer partitions along with
zero_to_fp32.pyscript in checkpoint directory. You can use this script to do offline consolidation. It requires no configuration files or GPUs. Here is an example of its usage:Copied
To get 32bit model for saving/inference, you can perform:
Copied
If you are only interested in the
state_dict, you can do the following:Copied
Note that all these functions require ~2x memory (general RAM) of the size of the final checkpoint.
ZeRO Inference
DeepSpeed ZeRO Inference supports ZeRO stage 3 with ZeRO-Infinity. It uses the same ZeRO protocol as training, but it doesn’t use an optimizer and a lr scheduler and only stage 3 is relevant. With accelerate integration, you just need to prepare the model and dataloader as shown below:
Copied
Few caveats to be aware of
Current integration doesn’t support Pipeline Parallelism of DeepSpeed.
Current integration doesn’t support
mpu, limiting the tensor parallelism which is supported in Megatron-LM.Current integration doesn’t support multiple models.
DeepSpeed Resources
The documentation for the internals related to deepspeed can be found here.
Papers:
Finally, please, remember that 🌍 Accelerate only integrates DeepSpeed, therefore if you have any problems or questions with regards to DeepSpeed usage, please, file an issue with DeepSpeed GitHub.
Last updated