Comparing performance across distributed setups

Comparing performance between different device setups

Evaluating and comparing the performance from different setups can be quite tricky if you don’t know what to look for. For example, you cannot run the same script with the same batch size across TPU, multi-GPU, and single-GPU with Accelerate and expect your results to line up.

But why?

There are three reasons for this that this tutorial will cover:

Setting the right seeds
Observed Batch Sizes
Learning Rates

Setting the Seed

While this issue has not come up as much, make sure to use utils.set_seed() to fully set the seed in all distributed cases so training will be reproducible:

Copied

from accelerate.utils import set_seed

set_seed(42)

Why is this important? Under the hood this will set 5 different seed settings:

Copied

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # ^^ safe to call this function even if cuda is not available
    if is_tpu_available():
        xm.set_rng_state(seed)

The random state, numpy’s state, torch, torch’s cuda state, and if TPUs are available torch_xla’s cuda state.

Observed Batch Sizes

When training with Accelerate, the batch size passed to the dataloader is the batch size per GPU. What this entails is a batch size of 64 on two GPUs is truly a batch size of 128. As a result, when testing on a single GPU this needs to be accounted for, as well as similarly for TPUs.

The below table can be used as a quick reference to try out different batch sizes:

In this example, there are two GPUs for “Multi-GPU” and a TPU pod with 8 workers

Single GPU Batch Size

Multi-GPU Equivalent Batch Size

TPU Equivalent Batch Size

256

128

Learning Rates

As noted in multiple sources[1][2], the learning rate should be scaled linearly based on the number of devices present. The below snippet shows doing so with Accelerate:

Since users can have their own learning rate schedulers defined, we leave this up to the user to decide if they wish to scale their learning rate or not.

Copied

learning_rate = 1e-3
accelerator = Accelerator()
learning_rate *= accelerator.num_processes

optimizer = AdamW(params=model.parameters(), lr=learning_rate)

You will also find that accelerate will step the learning rate based on the number of processes being trained on. This is because of the observed batch size noted earlier. So in the case of 2 GPUs, the learning rate will be stepped twice as often as a single GPU to account for the batch size being twice as large (if no changes to the batch size on the single GPU instance are made).

Gradient Accumulation and Mixed Precision

When using gradient accumulation and mixed precision, due to how gradient averaging works (accumulation) and the precision loss (mixed precision), some degradation in performance is expected. This will be explicitly seen when comparing the batch-wise loss between different compute setups. However, the overall loss, metric, and general performance at the end of training should be roughly the same.

PreviousLoading big models into memory NextExecuting and deferring jobs

Last updated 1 year ago