# Distributed Training

## Distributed training

When training on a single CPU is too slow, we can use multiple CPUs. This guide focuses on PyTorch-based DDP enabling distributed CPU training efficiently.

Distributed training on multiple CPUs is launched by mpirun which supports both Gloo and oneCCL as collective communication backends. And for performance seek, Intel recommends to use oneCCL backend.

[Intel® oneCCL](https://github.com/oneapi-src/oneCCL) (collective communications library) is a library for efficient distributed deep learning training implementing such collectives like allreduce, allgather, alltoall. For more information on oneCCL, please refer to the [oneCCL documentation](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html) and [oneCCL specification](https://spec.oneapi.com/versions/latest/elements/oneCCL/source/index.html).

Module `oneccl_bindings_for_pytorch` (`torch_ccl` before version 1.12) implements PyTorch C10D ProcessGroup API and can be dynamically loaded as external ProcessGroup and only works on Linux platform now

Check more detailed information for [oneccl\_bind\_pt](https://github.com/intel/torch-ccl).

We will show how to use oneCCL backend-ed distributed training as below steps.

Intel® oneCCL Bindings for PyTorch installation:

Wheel files are available for the following Python versions:

| Extension Version | Python 3.6 | Python 3.7 | Python 3.8 | Python 3.9 | Python 3.10 |
| :---------------: | :--------: | :--------: | :--------: | :--------: | :---------: |
|       1.12.1      |            |      √     |      √     |      √     |      √      |
|       1.12.0      |            |      √     |      √     |      √     |      √      |
|       1.11.0      |            |      √     |      √     |      √     |      √      |
|       1.10.0      |      √     |      √     |      √     |      √     |             |

Copied

```
pip install oneccl_bind_pt=={pytorch_version} -f https://software.intel.com/ipex-whl-stable
```

where `{pytorch_version}` should be your PyTorch version, for instance 1.12.0. Versions of oneCCL and PyTorch must match. `oneccl_bindings_for_pytorch` 1.12.0 prebuilt wheel does not work with PyTorch 1.12.1 (it is for PyTorch 1.12.0) PyTorch 1.12.1 should work with `oneccl_bindings_for_pytorch` 1.12.1

MPI tool set for Intel® oneCCL 1.12.0

Copied

```
oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
```

for Intel® oneCCL whose version < 1.12.0

Copied

```
torch_ccl_path=$(python -c "import torch; import torch_ccl; import os;  print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))")
source $torch_ccl_path/env/setvars.sh
```

The following command enables training with 2 processes on one node, with one process running per one socket. The variables OMP\_NUM\_THREADS/CCL\_WORKER\_COUNT can be tuned for optimal performance.

Copied

```
export CCL_WORKER_COUNT=1
export MASTER_ADDR=127.0.0.1
mpirun -n 2 -genv OMP_NUM_THREADS=23 \
python3 run_qa.py \
    --model_name_or_path distilbert-base-uncased-distilled-squad \
    --dataset_name squad \
    --apply_quantization \
    --quantization_approach static \
    --do_train \
    --do_eval \
    --verify_loading \
    --output_dir /tmp/squad_output \
    --no_cuda \
    --xpu_backend ccl
```

The following command enables training with a total of four processes on two nodes (node0 and node1, taking node0 as the main process), ppn (processes per node) is set to 2, with one process running per one socket. The variables OMP\_NUM\_THREADS/CCL\_WORKER\_COUNT can be tuned for optimal performance.

In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.

Copied

```
 cat hostfile
 xxx.xxx.xxx.xxx #node0 ip
 xxx.xxx.xxx.xxx #node1 ip
```

Now, run the following command in node0 and **4DDP** will be enabled in node0 and node1:

Copied

```
export CCL_WORKER_COUNT=1
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
mpirun -f hostfile -n 4 -ppn 2 \
-genv OMP_NUM_THREADS=23 \
python3 run_qa.py \
    --model_name_or_path distilbert-base-uncased-distilled-squad \
    --dataset_name squad \
    --apply_quantization \
    --quantization_approach static \
    --do_train \
    --do_eval \
    --verify_loading \
    --output_dir /tmp/squad_output \
    --no_cuda \
    --xpu_backend ccl
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/optimum/intel/neural-compressor/distributed-training.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
