Comparing performance across distributed setups
Comparing performance between different device setups
Evaluating and comparing the performance from different setups can be quite tricky if you donโt know what to look for. For example, you cannot run the same script with the same batch size across TPU, multi-GPU, and single-GPU with Accelerate and expect your results to line up.
But why?
There are three reasons for this that this tutorial will cover:
Setting the right seeds
Observed Batch Sizes
Learning Rates
Setting the Seed
While this issue has not come up as much, make sure to use utils.set_seed() to fully set the seed in all distributed cases so training will be reproducible:
Copied
Why is this important? Under the hood this will set 5 different seed settings:
Copied
The random state, numpyโs state, torch, torchโs cuda state, and if TPUs are available torch_xlaโs cuda state.
Observed Batch Sizes
When training with Accelerate, the batch size passed to the dataloader is the batch size per GPU. What this entails is a batch size of 64 on two GPUs is truly a batch size of 128. As a result, when testing on a single GPU this needs to be accounted for, as well as similarly for TPUs.
The below table can be used as a quick reference to try out different batch sizes:
In this example, there are two GPUs for โMulti-GPUโ and a TPU pod with 8 workers
Learning Rates
As noted in multiple sources[1][2], the learning rate should be scaled linearly based on the number of devices present. The below snippet shows doing so with Accelerate:
Since users can have their own learning rate schedulers defined, we leave this up to the user to decide if they wish to scale their learning rate or not.
Copied
You will also find that accelerate
will step the learning rate based on the number of processes being trained on. This is because of the observed batch size noted earlier. So in the case of 2 GPUs, the learning rate will be stepped twice as often as a single GPU to account for the batch size being twice as large (if no changes to the batch size on the single GPU instance are made).
Gradient Accumulation and Mixed Precision
When using gradient accumulation and mixed precision, due to how gradient averaging works (accumulation) and the precision loss (mixed precision), some degradation in performance is expected. This will be explicitly seen when comparing the batch-wise loss between different compute setups. However, the overall loss, metric, and general performance at the end of training should be roughly the same.
Last updated