DeepSpeed

DeepSpeed is a library designed for speed and scale for distributed training of large models with billions of parameters. At its core is the Zero Redundancy Optimizer (ZeRO) that shards optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across data parallel processes. This drastically reduces memory usage, allowing you to scale your training to billion parameter models. To unlock even more memory efficiency, ZeRO-Offload reduces GPU compute and memory by leveraging CPU resources during optimization.

Both of these features are supported in 🌍 Accelerate, and you can use them with 🌍 PEFT. This guide will help you learn how to use our DeepSpeed training script. You’ll configure the script to train a large model for conditional generation with ZeRO-3 and ZeRO-Offload.

πŸ’‘ To help you get started, check out our example training scripts for causal language modeling and conditional generation. You can adapt these scripts for your own applications or even use them out of the box if your task is similar to the one in the scripts.

Configuration

Start by running the following command to create a DeepSpeed configuration file with🌍 Accelerate. The --config_file flag allows you to save the configuration file to a specific location, otherwise it is saved as a default_config.yaml file in the 🌍 Accelerate cache.

The configuration file is used to set the default options when you launch the training script.

Copied

accelerate config --config_file ds_zero3_cpu.yaml

You’ll be asked a few questions about your setup, and configure the following arguments. In this example, you’ll use ZeRO-3 and ZeRO-Offload so make sure you pick those options.

Copied

`zero_stage`: [0] Disabled, [1] optimizer state partitioning, [2] optimizer+gradient state partitioning and [3] optimizer+gradient+parameter partitioning
`gradient_accumulation_steps`: Number of training steps to accumulate gradients before averaging and applying them.
`gradient_clipping`: Enable gradient clipping with value.
`offload_optimizer_device`: [none] Disable optimizer offloading, [cpu] offload optimizer to CPU, [nvme] offload optimizer to NVMe SSD. Only applicable with ZeRO >= Stage-2.
`offload_param_device`: [none] Disable parameter offloading, [cpu] offload parameters to CPU, [nvme] offload parameters to NVMe SSD. Only applicable with ZeRO Stage-3.
`zero3_init_flag`: Decides whether to enable `deepspeed.zero.Init` for constructing massive models. Only applicable with ZeRO Stage-3.
`zero3_save_16bit_model`: Decides whether to save 16-bit model weights when using ZeRO Stage-3.
`mixed_precision`: `no` for FP32 training, `fp16` for FP16 mixed-precision training and `bf16` for BF16 mixed-precision training. 

An example configuration file might look like the following. The most important thing to notice is that zero_stage is set to 3, and offload_optimizer_device and offload_param_device are set to the cpu.

Copied

The important parts

Let’s dive a little deeper into the script so you can see what’s going on, and understand how it works.

Within the main function, the script creates an Accelerator class to initialize all the necessary requirements for distributed training.

πŸ’‘ Feel free to change the model and dataset inside the main function. If your dataset format is different from the one in the script, you may also need to write your own preprocessing function.

The script also creates a configuration for the 🌍 PEFT method you’re using, which in this case, is LoRA. The LoraConfig specifies the task type and important parameters such as the dimension of the low-rank matrices, the matrices scaling factor, and the dropout probability of the LoRA layers. If you want to use a different 🌍 PEFT method, make sure you replace LoraConfig with the appropriate class.

Copied

Throughout the script, you’ll see the main_process_first and wait_for_everyone functions which help control and synchronize when processes are executed.

The get_peft_model() function takes a base model and the peft_config you prepared earlier to create a PeftModel:

Copied

Pass all the relevant training objects to 🌍 Accelerate’s prepare which makes sure everything is ready for training:

Copied

The next bit of code checks whether the DeepSpeed plugin is used in the Accelerator, and if the plugin exists, then the Accelerator uses ZeRO-3 as specified in the configuration file:

Copied

Inside the training loop, the usual loss.backward() is replaced by 🌍 Accelerate’s backward which uses the correct backward() method based on your configuration:

Copied

That is all! The rest of the script handles the training loop, evaluation, and even pushes it to the Hub for you.

Train

Run the following command to launch the training script. Earlier, you saved the configuration file to ds_zero3_cpu.yaml, so you’ll need to pass the path to the launcher with the --config_file argument like this:

Copied

You’ll see some output logs that track memory usage during training, and once it’s completed, the script returns the accuracy and compares the predictions to the labels:

Copied

Last updated