Optimum
  • ๐ŸŒOVERVIEW
    • Optimum
    • Installation
    • Quick tour
    • Notebooks
    • ๐ŸŒCONCEPTUAL GUIDES
      • Quantization
  • ๐ŸŒHABANA
    • BOINC AI Optimum Habana
    • Installation
    • Quickstart
    • ๐ŸŒTUTORIALS
      • Overview
      • Single-HPU Training
      • Distributed Training
      • Run Inference
      • Stable Diffusion
      • LDM3D
    • ๐ŸŒHOW-TO GUIDES
      • Overview
      • Pretraining Transformers
      • Accelerating Training
      • Accelerating Inference
      • How to use DeepSpeed
      • Multi-node Training
    • ๐ŸŒCONCEPTUAL GUIDES
      • What are Habana's Gaudi and HPUs?
    • ๐ŸŒREFERENCE
      • Gaudi Trainer
      • Gaudi Configuration
      • Gaudi Stable Diffusion Pipeline
      • Distributed Runner
  • ๐ŸŒINTEL
    • BOINC AI Optimum Intel
    • Installation
    • ๐ŸŒNEURAL COMPRESSOR
      • Optimization
      • Distributed Training
      • Reference
    • ๐ŸŒOPENVINO
      • Models for inference
      • Optimization
      • Reference
  • ๐ŸŒAWS TRAINIUM/INFERENTIA
    • BOINC AI Optimum Neuron
  • ๐ŸŒFURIOSA
    • BOINC AI Optimum Furiosa
    • Installation
    • ๐ŸŒHOW-TO GUIDES
      • Overview
      • Modeling
      • Quantization
    • ๐ŸŒREFERENCE
      • Models
      • Configuration
      • Quantization
  • ๐ŸŒONNX RUNTIME
    • Overview
    • Quick tour
    • ๐ŸŒHOW-TO GUIDES
      • Inference pipelines
      • Models for inference
      • How to apply graph optimization
      • How to apply dynamic and static quantization
      • How to accelerate training
      • Accelerated inference on NVIDIA GPUs
    • ๐ŸŒCONCEPTUAL GUIDES
      • ONNX And ONNX Runtime
    • ๐ŸŒREFERENCE
      • ONNX Runtime Models
      • Configuration
      • Optimization
      • Quantization
      • Trainer
  • ๐ŸŒEXPORTERS
    • Overview
    • The TasksManager
    • ๐ŸŒONNX
      • Overview
      • ๐ŸŒHOW-TO GUIDES
        • Export a model to ONNX
        • Add support for exporting an architecture to ONNX
      • ๐ŸŒREFERENCE
        • ONNX configurations
        • Export functions
    • ๐ŸŒTFLITE
      • Overview
      • ๐ŸŒHOW-TO GUIDES
        • Export a model to TFLite
        • Add support for exporting an architecture to TFLite
      • ๐ŸŒREFERENCE
        • TFLite configurations
        • Export functions
  • ๐ŸŒTORCH FX
    • Overview
    • ๐ŸŒHOW-TO GUIDES
      • Optimization
    • ๐ŸŒCONCEPTUAL GUIDES
      • Symbolic tracer
    • ๐ŸŒREFERENCE
      • Optimization
  • ๐ŸŒBETTERTRANSFORMER
    • Overview
    • ๐ŸŒTUTORIALS
      • Convert Transformers models to use BetterTransformer
      • How to add support for new architectures?
  • ๐ŸŒLLM QUANTIZATION
    • GPTQ quantization
  • ๐ŸŒUTILITIES
    • Dummy input generators
    • Normalized configurations
Powered by GitBook
On this page
  1. INTEL
  2. NEURAL COMPRESSOR

Distributed Training

PreviousOptimizationNextReference

Last updated 1 year ago

Distributed training

When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling distributed CPU training efficiently.

Distributed training on multiple CPUs is launched by mpirun which supports both Gloo and oneCCL as collective communication backends. And for performance seek, Intel recommends to use oneCCL backend.

(collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the and .

Module oneccl_bindings_for_pytorch (torch_ccl before version 1.12) implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now

Check more detailed information for .

We will show how to use oneCCL backend-ed distributed training as below steps.

Intelยฎ oneCCL Bindings for PyTorch installation:

Wheel files are available for the following Python versions:

Extension Version
Python 3.6
Python 3.7
Python 3.8
Python 3.9
Python 3.10

1.12.1

โˆš

โˆš

โˆš

โˆš

1.12.0

โˆš

โˆš

โˆš

โˆš

1.11.0

โˆš

โˆš

โˆš

โˆš

1.10.0

โˆš

โˆš

โˆš

โˆš

Copied

pip install oneccl_bind_pt=={pytorch_version} -f https://software.intel.com/ipex-whl-stable

where {pytorch_version} should be your PyTorch version, for instance 1.12.0. Versions of oneCCL and PyTorch must match. oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0) PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.1

MPI tool set for Intelยฎ oneCCL 1.12.0

Copied

oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh

for Intelยฎ oneCCL whose version < 1.12.0

Copied

torch_ccl_path=$(python -c "import torch; import torch_ccl; import os;  print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))")
source $torch_ccl_path/env/setvars.sh

The following command enables training with 2 processes on one node, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

Copied

export CCL_WORKER_COUNT=1
export MASTER_ADDR=127.0.0.1
mpirun -n 2 -genv OMP_NUM_THREADS=23 \
python3 run_qa.py \
    --model_name_or_path distilbert-base-uncased-distilled-squad \
    --dataset_name squad \
    --apply_quantization \
    --quantization_approach static \
    --do_train \
    --do_eval \
    --verify_loading \
    --output_dir /tmp/squad_output \
    --no_cuda \
    --xpu_backend ccl

The following command enables training with a total of four processes on two nodes (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.

Copied

 cat hostfile
 xxx.xxx.xxx.xxx #node0 ip
 xxx.xxx.xxx.xxx #node1 ip

Now, run the following command in node0 and 4DDP will be enabled in node0 and node1:

Copied

export CCL_WORKER_COUNT=1
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
mpirun -f hostfile -n 4 -ppn 2 \
-genv OMP_NUM_THREADS=23 \
python3 run_qa.py \
    --model_name_or_path distilbert-base-uncased-distilled-squad \
    --dataset_name squad \
    --apply_quantization \
    --quantization_approach static \
    --do_train \
    --do_eval \
    --verify_loading \
    --output_dir /tmp/squad_output \
    --no_cuda \
    --xpu_backend ccl
๐ŸŒ
๐ŸŒ
Intelยฎ oneCCL
oneCCL documentation
oneCCL specification
oneccl_bind_pt