Multi-node Training
Multi-node Training
Using several Gaudi servers to perform multi-node training can be done easily. This guide shows how to:
set up several Gaudi instances
set up your computing environment
launch a multi-node run
Setting up several Gaudi instances
Two types of configurations are possible:
scale-out using Gaudi NICs or Host NICs (on-premises)
scale-out using AWS DL1 instances
On premises
To set up your servers on premises, check out the installation and distributed training pages of Habana Gaudi’s documentation.
AWS DL1 instances
Proceed with the following steps to correctly set up your DL1 instances.
1. Set up an EFA-enabled security group
To allow all instances to communicate with each other, you need to set up a security group as described by AWS in step 1 of this link. Once this is done, it should look as follows:
2. Launching instances
When you launch instances from the AWS EC2 console, you can choose the number of nodes to set up.
We recommend using the Habana Deep Learning Base AMI for your AWS DL1 instances. It is an EFA-enabled AMI so you do not need to install the EFA software (which may be necessary if you use a different AMI, installation instructions here).
Then, in the Network settings, select the security group you created in the previous step. You also have to select a specific subnet to unlock the Advanced network configuration in which you can enable the Elastic Fabric Adapter.
The last parameter to set is the Placement group in the Advanced details. You can create one if you do not have any. The placement strategy should be set to cluster.
Here is how it should look:
More information here.
Launching a Multi-node Run
Once your Gaudi instances are ready, you need to:
Enable password-less SSH on your instances so that they can communicate with each other. This explains how to do it.
On AWS, to train through EFA,
hccl_ofi_wrapper
should be installed. Here is how to do it.On AWS, you need to set the following environment variables (the easiest way is to write a
.deepspeed_env
file as described here):
HCCL_OVER_OFI=1
LD_LIBRARY_PATH=path_to_hccl_ofi_wrapper:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib
wherepath_to_hccl_ofi_wrapper
is the path to thehccl_ofi_wrapper
folder which you installed in the previous step.(optional)
HCCL_SOCKET_IFNAME=my_network_interface
. If not set, the first network interface with a name that does not start withlo
ordocker
will be used. More information here.
To make this easier, we provide a Dockerfile here. You will just have to copy the public key of the leader node in the ~/.ssh/authorized_keys
file of all other nodes to enable password-less SSH.
Then, you need to write a hostfile with the addresses and the numbers of devices of your nodes as follows:
Copied
Finally, there are two possible ways to run your training script on several nodes:
With the
gaudi_spawn.py
script, you can run the following command:
Copied
where --argX
is an argument of the script to run.
With the
DistributedRunner
, you can add this code snippet to a script:
Copied
Environment Variables
If you need to set environment variables for all nodes, you can specify them in a .deepspeed_env
file which should be located in the local path you are executing from or in your home directory. The format is the following:
Copied
You can find an example for AWS instances here.
Recommendations
It is strongly recommended to use gradient checkpointing for multi-node runs to get the highest speedups. You can enable it with
--gradient_checkpointing
in these examples or withgradient_checkpointing=True
in yourGaudiTrainingArguments
.Larger batch sizes should lead to higher speedups.
Multi-node inference is not recommended and can provide inconsistent results.
On AWS DL1 instances, run your Docker containers with the
--privileged
flag so that EFA devices are visible.
Example
In this example, we fine-tune a pre-trained GPT2-XL model on the WikiText dataset. We are going to use the causal language modeling example which is given in the Github repository.
The first step consists in training the model on several nodes with this command:
Copied
Evaluation is not performed in the same command because we do not recommend performing multi-node inference at the moment.
Once the model is trained, we can evaluate it with the following command. The argument --model_name_or_path
should be equal to the argument --output_dir
of the previous command.
Copied
Last updated