Optimum
  • 🌍OVERVIEW
    • Optimum
    • Installation
    • Quick tour
    • Notebooks
    • 🌍CONCEPTUAL GUIDES
      • Quantization
  • 🌍HABANA
    • BOINC AI Optimum Habana
    • Installation
    • Quickstart
    • 🌍TUTORIALS
      • Overview
      • Single-HPU Training
      • Distributed Training
      • Run Inference
      • Stable Diffusion
      • LDM3D
    • 🌍HOW-TO GUIDES
      • Overview
      • Pretraining Transformers
      • Accelerating Training
      • Accelerating Inference
      • How to use DeepSpeed
      • Multi-node Training
    • 🌍CONCEPTUAL GUIDES
      • What are Habana's Gaudi and HPUs?
    • 🌍REFERENCE
      • Gaudi Trainer
      • Gaudi Configuration
      • Gaudi Stable Diffusion Pipeline
      • Distributed Runner
  • 🌍INTEL
    • BOINC AI Optimum Intel
    • Installation
    • 🌍NEURAL COMPRESSOR
      • Optimization
      • Distributed Training
      • Reference
    • 🌍OPENVINO
      • Models for inference
      • Optimization
      • Reference
  • 🌍AWS TRAINIUM/INFERENTIA
    • BOINC AI Optimum Neuron
  • 🌍FURIOSA
    • BOINC AI Optimum Furiosa
    • Installation
    • 🌍HOW-TO GUIDES
      • Overview
      • Modeling
      • Quantization
    • 🌍REFERENCE
      • Models
      • Configuration
      • Quantization
  • 🌍ONNX RUNTIME
    • Overview
    • Quick tour
    • 🌍HOW-TO GUIDES
      • Inference pipelines
      • Models for inference
      • How to apply graph optimization
      • How to apply dynamic and static quantization
      • How to accelerate training
      • Accelerated inference on NVIDIA GPUs
    • 🌍CONCEPTUAL GUIDES
      • ONNX And ONNX Runtime
    • 🌍REFERENCE
      • ONNX Runtime Models
      • Configuration
      • Optimization
      • Quantization
      • Trainer
  • 🌍EXPORTERS
    • Overview
    • The TasksManager
    • 🌍ONNX
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to ONNX
        • Add support for exporting an architecture to ONNX
      • 🌍REFERENCE
        • ONNX configurations
        • Export functions
    • 🌍TFLITE
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to TFLite
        • Add support for exporting an architecture to TFLite
      • 🌍REFERENCE
        • TFLite configurations
        • Export functions
  • 🌍TORCH FX
    • Overview
    • 🌍HOW-TO GUIDES
      • Optimization
    • 🌍CONCEPTUAL GUIDES
      • Symbolic tracer
    • 🌍REFERENCE
      • Optimization
  • 🌍BETTERTRANSFORMER
    • Overview
    • 🌍TUTORIALS
      • Convert Transformers models to use BetterTransformer
      • How to add support for new architectures?
  • 🌍LLM QUANTIZATION
    • GPTQ quantization
  • 🌍UTILITIES
    • Dummy input generators
    • Normalized configurations
Powered by GitBook
On this page
  • DeepSpeed for HPUs
  • Setup
  • Using DeepSpeed with Optimum Habana
  1. HABANA
  2. HOW-TO GUIDES

How to use DeepSpeed

PreviousAccelerating InferenceNextMulti-node Training

Last updated 1 year ago

DeepSpeed for HPUs

enables you to fit and train larger models on HPUs thanks to various optimizations described in the . In particular, you can use the two following ZeRO configurations that have been validated to be fully functioning with Gaudi:

  • ZeRO-1: partitions the optimizer states across processes.

  • ZeRO-2: partitions the optimizer states + gradients across processes.

These configurations are fully compatible with Habana Mixed Precision and can thus be used to train your model in bf16 precision.

You can find more information about DeepSpeed Gaudi integration .

Setup

To use DeepSpeed on Gaudi, you need to install Optimum Habana and with:

Copied

pip install optimum[habana]
pip install git+https://github.com/HabanaAI/DeepSpeed.git@1.12.0

Using DeepSpeed with Optimum Habana

The allows using DeepSpeed as easily as the . This can be done in 3 steps:

  1. A DeepSpeed configuration has to be defined.

  2. The deepspeed training argument enables to specify the path to the DeepSpeed configuration.

  3. The deepspeed launcher must be used to run your script.

DeepSpeed configuration

The DeepSpeed configuration to use is passed through a JSON file and enables you to choose the optimizations to apply. Here is an example for applying ZeRO-2 optimizations and bf16 precision:

Copied

{
    "steps_per_print": 64,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "gradient_accumulation_steps": "auto",
    "bf16": {
        "enabled": true
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 2,
        "overlap_comm": false,
        "reduce_scatter": false,
        "contiguous_gradients": false
    }
}

The deepspeed training argument

To use DeepSpeed, you must specify deespeed=path_to_my_deepspeed_configuration in your GaudiTrainingArguments instance:

Copied

training_args = GaudiTrainingArguments(
    # my usual training arguments...
    use_habana=True,
    use_lazy_mode=True,
    gaudi_config_name=path_to_my_gaudi_config,
    deepspeed=path_to_my_deepspeed_config,
)

This argument both indicates that DeepSpeed should be used and points to your DeepSpeed configuration.

Launching your script

Finally, there are two possible ways to launch your script:

Copied

python gaudi_spawn.py \
    --world_size number_of_hpu_you_have --use_deepspeed \
    path_to_script.py --args1 --args2 ... --argsN \
    --deepspeed path_to_deepspeed_config

where --argX is an argument of the script to run with DeepSpeed.

Copied

from optimum.habana.distributed import DistributedRunner
from optimum.utils import logging

world_size=8 # Number of HPUs to use (1 or 8)

# define distributed runner
distributed_runner = DistributedRunner(
    command_list=["scripts/train.py --args1 --args2 ... --argsN --deepspeed path_to_deepspeed_config"],
    world_size=world_size,
    use_deepspeed=True,
)

# start job
ret_code = distributed_runner.run()

You should set "use_fused_adam": false in your Gaudi configuration because it is not compatible with DeepSpeed yet.

These steps are detailed below. A comprehensive guide about how to use DeepSpeed with the Transformers Trainer is also available .

The special value "auto" enables to automatically get the correct or most efficient value. You can also specify the values yourself but, if you do so, you should be careful not to have conflicting values with your training arguments. It is strongly advised to read in the Transformers documentation to completely understand how this works.

Other examples of configurations for HPUs are proposed by Habana.

The explains how to write a configuration from scratch very well. A more complete description of all configuration possibilities is available .

Using the script:

Using the directly in code:

🌍
🌍
DeepSpeed
ZeRO paper
here
Habana’s DeepSpeed fork
GaudiTrainer
Transformers Trainer
here
this section
here
Transformers documentation
here
gaudi_spawn.py
DistributedRunner