Distributed CPU training
Efficient Training on Multiple CPUs
When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling distributed CPU training efficiently.
Intelยฎ oneCCL Bindings for PyTorch
Intelยฎ oneCCL (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the oneCCL documentation and oneCCL specification.
Module oneccl_bindings_for_pytorch
(torch_ccl
before version 1.12) implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now
Check more detailed information for oneccl_bind_pt.
Intelยฎ oneCCL Bindings for PyTorch installation:
Wheel files are available for the following Python versions:
1.13.0
โ
โ
โ
โ
1.12.100
โ
โ
โ
โ
1.12.0
โ
โ
โ
โ
1.11.0
โ
โ
โ
โ
1.10.0
โ
โ
โ
โ
Copied
where {pytorch_version}
should be your PyTorch version, for instance 1.13.0. Check more approaches for oneccl_bind_pt installation. Versions of oneCCL and PyTorch must match.
oneccl_bindings_for_pytorch 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0) PyTorch 1.12.1 should work with oneccl_bindings_for_pytorch 1.12.100
Intelยฎ MPI library
Use this standards-based MPI implementation to deliver flexible, efficient, scalable cluster messaging on Intelยฎ architecture. This component is part of the Intelยฎ oneAPI HPC Toolkit.
oneccl_bindings_for_pytorch is installed along with the MPI tool set. Need to source the environment before using it.
for Intelยฎ oneCCL >= 1.12.0
Copied
for Intelยฎ oneCCL whose version < 1.12.0
Copied
IPEX installation:
IPEX provides performance optimizations for CPU training with both Float32 and BFloat16, you could refer single CPU section.
The following โUsage in Trainerโ takes mpirun in Intelยฎ MPI library as an example.
Usage in Trainer
To enable multi CPU distributed training in the Trainer with the ccl backend, users should add --ddp_backend ccl
in the command arguments.
Letโs see an example with the question-answering example
The following command enables training with 2 processes on one Xeon node, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
Copied
The following command enables training with a total of four processes on two Xeons (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.
In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.
Copied
Now, run the following command in node0 and 4DDP will be enabled in node0 and node1 with BF16 auto mixed precision:
Copied
Last updated