Multi-node Training
Last updated
Last updated
Using several Gaudi servers to perform multi-node training can be done easily. This guide shows how to:
set up several Gaudi instances
set up your computing environment
launch a multi-node run
Two types of configurations are possible:
scale-out using Gaudi NICs or Host NICs (on-premises)
scale-out using AWS DL1 instances
To set up your servers on premises, check out the and pages of Habana Gaudi’s documentation.
Proceed with the following steps to correctly set up your DL1 instances.
1. Set up an EFA-enabled security group
To allow all instances to communicate with each other, you need to set up a security group as described by AWS in step 1 of . Once this is done, it should look as follows:
2. Launching instances
When you launch instances from the AWS EC2 console, you can choose the number of nodes to set up.
Then, in the Network settings, select the security group you created in the previous step. You also have to select a specific subnet to unlock the Advanced network configuration in which you can enable the Elastic Fabric Adapter.
The last parameter to set is the Placement group in the Advanced details. You can create one if you do not have any. The placement strategy should be set to cluster.
Here is how it should look:
Once your Gaudi instances are ready, you need to:
HCCL_OVER_OFI=1
LD_LIBRARY_PATH=path_to_hccl_ofi_wrapper:/opt/amazon/openmpi/lib:/opt/amazon/efa/lib
where path_to_hccl_ofi_wrapper
is the path to the hccl_ofi_wrapper
folder which you installed in the previous step.
Copied
Finally, there are two possible ways to run your training script on several nodes:
Copied
where --argX
is an argument of the script to run.
Copied
Copied
Larger batch sizes should lead to higher speedups.
Multi-node inference is not recommended and can provide inconsistent results.
On AWS DL1 instances, run your Docker containers with the --privileged
flag so that EFA devices are visible.
The first step consists in training the model on several nodes with this command:
Copied
Evaluation is not performed in the same command because we do not recommend performing multi-node inference at the moment.
Once the model is trained, we can evaluate it with the following command. The argument --model_name_or_path
should be equal to the argument --output_dir
of the previous command.
Copied
Security group for multi-node training on AWS DL1 instances
We recommend using the for your AWS DL1 instances. It is an EFA-enabled AMI so you do not need to install the EFA software (which may be necessary if you use a different AMI, installation instructions ).
Parameters for launching EFA-enabled AWS instances. The important parameters to set are circled in red. For the sake of clarity, not all parameters are represented.
More information .
Enable password-less SSH on your instances so that they can communicate with each other. .
On AWS, to train through EFA, hccl_ofi_wrapper
should be installed. .
On AWS, you need to set the following environment variables (the easiest way is to write a .deepspeed_env
file as described ):
(optional) HCCL_SOCKET_IFNAME=my_network_interface
. If not set, the first network interface with a name that does not start with lo
or docker
will be used. More information .
To make this easier, we provide a Dockerfile . You will just have to copy the public key of the leader node in the ~/.ssh/authorized_keys
file of all other nodes to enable password-less SSH.
Then, you need to write a with the addresses and the numbers of devices of your nodes as follows:
With the script, you can run the following command:
With the , you can add this code snippet to a script:
If you need to set environment variables for all nodes, you can specify them in a file which should be located in the local path you are executing from or in your home directory. The format is the following:
You can find an example for AWS instances .
It is strongly recommended to use gradient checkpointing for multi-node runs to get the highest speedups. You can enable it with --gradient_checkpointing
in or with gradient_checkpointing=True
in your GaudiTrainingArguments
.
In this example, we fine-tune a pre-trained GPT2-XL model on the . We are going to use the .