Accelerating training with local SGD

Using Local SGD with ๐ŸŒ Accelerate

Local SGD is a technique for distributed training where gradients are not synchronized every step. Thus, each process updates its own version of the model weights and after a given number of steps these weights are synchronized by averaging across all processes. This improves communication efficiency and can lead to substantial training speed up especially when a computer lacks a faster interconnect such as NVLink. Unlike gradient accumulation (where improving communication efficiency requires increasing the effective batch size), Local SGD does not require changing a batch size or a learning rate / schedule. However, if necessary, Local SGD can be combined with gradient accumulation as well.

In this tutorial you will see how to quickly setup Local SGD ๐ŸŒ Accelerate. Compared to a standard Accelerate setup, this requires only two extra lines of code.

This example will use a very simplistic PyTorch training loop that performs gradient accumulation every two batches:

Copied

device = "cuda"
model.to(device)

gradient_accumulation_steps = 2

for index, batch in enumerate(training_dataloader):
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss = loss / gradient_accumulation_steps
    loss.backward()
    if (index + 1) % gradient_accumulation_steps == 0:
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

Converting it to ๐ŸŒ Accelerate

First the code shown earlier will be converted to use ๐ŸŒ Accelerate with neither a LocalSGD or a gradient accumulation helper:

Copied

Letting ๐ŸŒ Accelerate handle model synchronization

All that is left now is to let ๐ŸŒ Accelerate handle model parameter synchronization and the gradient accumulation for us. For simplicity let us assume we need to synchronize every 8 steps. This is achieved by adding one with LocalSGD statement and one call local_sgd.step() after every optimizer step:

Copied

Under the hood, the Local SGD code disables automatic gradient synchornization (but accumulation still works as expected!). Instead it averages model parameters every local_sgd_steps steps (as well as in the end of the training loop).

Limitations

The current implementation works only with basic multi-GPU (or multi-CPU) training without, e.g., DeepSpeed.arrow-up-right.

References

Although we are not aware of the true origins of this simple approach, the idea of local SGD is quite old and goes back to at least:

Zhang, J., De Sa, C., Mitliagkas, I., & Rรฉ, C. (2016). Parallel SGD: When does averaging help?. arXiv preprint arXiv:1606.07365.arrow-up-right

We credit the term Local SGD to the following paper (but there might be earlier references we are not aware of).

Stich, Sebastian Urban. โ€œLocal SGD Converges Fast and Communicates Little.โ€ ICLR 2019-International Conference on Learning Representations. No. CONF. 2019.arrow-up-right๐ŸŒ

Last updated