DeepSpeed
Last updated
Last updated
is a library designed for speed and scale for distributed training of large models with billions of parameters. At its core is the Zero Redundancy Optimizer (ZeRO) that shards optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across data parallel processes. This drastically reduces memory usage, allowing you to scale your training to billion parameter models. To unlock even more memory efficiency, ZeRO-Offload reduces GPU compute and memory by leveraging CPU resources during optimization.
Both of these features are supported in π Accelerate, and you can use them with π PEFT. This guide will help you learn how to use our DeepSpeed . Youβll configure the script to train a large model for conditional generation with ZeRO-3 and ZeRO-Offload.
π‘ To help you get started, check out our example training scripts for and . You can adapt these scripts for your own applications or even use them out of the box if your task is similar to the one in the scripts.
Start by running the following command to withπ Accelerate. The --config_file
flag allows you to save the configuration file to a specific location, otherwise it is saved as a default_config.yaml
file in the π Accelerate cache.
The configuration file is used to set the default options when you launch the training script.
Copied
Youβll be asked a few questions about your setup, and configure the following arguments. In this example, youβll use ZeRO-3 and ZeRO-Offload so make sure you pick those options.
Copied
Copied
Letβs dive a little deeper into the script so you can see whatβs going on, and understand how it works.
π‘ Feel free to change the model and dataset inside the main
function. If your dataset format is different from the one in the script, you may also need to write your own preprocessing function.
Copied
Copied
Copied
The next bit of code checks whether the DeepSpeed plugin is used in the Accelerator
, and if the plugin exists, then the Accelerator
uses ZeRO-3 as specified in the configuration file:
Copied
Copied
That is all! The rest of the script handles the training loop, evaluation, and even pushes it to the Hub for you.
Run the following command to launch the training script. Earlier, you saved the configuration file to ds_zero3_cpu.yaml
, so youβll need to pass the path to the launcher with the --config_file
argument like this:
Copied
Youβll see some output logs that track memory usage during training, and once itβs completed, the script returns the accuracy and compares the predictions to the labels:
Copied
An example might look like the following. The most important thing to notice is that zero_stage
is set to 3
, and offload_optimizer_device
and offload_param_device
are set to the cpu
.
Within the function, the script creates an class to initialize all the necessary requirements for distributed training.
The script also creates a configuration for the π PEFT method youβre using, which in this case, is LoRA. The specifies the task type and important parameters such as the dimension of the low-rank matrices, the matrices scaling factor, and the dropout probability of the LoRA layers. If you want to use a different π PEFT method, make sure you replace LoraConfig
with the appropriate .
Throughout the script, youβll see the and functions which help control and synchronize when processes are executed.
The get_peft_model()
function takes a base model and the peft_config
you prepared earlier to create a :
Pass all the relevant training objects to π Accelerateβs which makes sure everything is ready for training:
Inside the training loop, the usual loss.backward()
is replaced by π Accelerateβs which uses the correct backward()
method based on your configuration: