Accelerate
  • ๐ŸŒGETTING STARTED
    • BOINC AI Accelerate
    • Installation
    • Quicktour
  • ๐ŸŒTUTORIALS
    • Overview
    • Migrating to BOINC AI Accelerate
    • Launching distributed code
    • Launching distributed training from Jupyter Notebooks
  • ๐ŸŒHOW-TO GUIDES
    • Start Here!
    • Example Zoo
    • How to perform inference on large models with small resources
    • Knowing how big of a model you can fit into memory
    • How to quantize model
    • How to perform distributed inference with normal resources
    • Performing gradient accumulation
    • Accelerating training with local SGD
    • Saving and loading training states
    • Using experiment trackers
    • Debugging timeout errors
    • How to avoid CUDA Out-of-Memory
    • How to use Apple Silicon M1 GPUs
    • How to use DeepSpeed
    • How to use Fully Sharded Data Parallelism
    • How to use Megatron-LM
    • How to use BOINC AI Accelerate with SageMaker
    • How to use BOINC AI Accelerate with Intelยฎ Extension for PyTorch for cpu
  • ๐ŸŒCONCEPTS AND FUNDAMENTALS
    • BOINC AI Accelerate's internal mechanism
    • Loading big models into memory
    • Comparing performance across distributed setups
    • Executing and deferring jobs
    • Gradient synchronization
    • TPU best practices
  • ๐ŸŒREFERENCE
    • Main Accelerator class
    • Stateful configuration classes
    • The Command Line
    • Torch wrapper classes
    • Experiment trackers
    • Distributed launchers
    • DeepSpeed utilities
    • Logging
    • Working with large models
    • Kwargs handlers
    • Utility functions and classes
    • Megatron-LM Utilities
    • Fully Sharded Data Parallelism Utilities
Powered by GitBook
On this page
  • Using Local SGD with ๐ŸŒ Accelerate
  • Converting it to ๐ŸŒ Accelerate
  • Letting ๐ŸŒ Accelerate handle model synchronization
  • Limitations
  • References
  1. HOW-TO GUIDES

Accelerating training with local SGD

Using Local SGD with ๐ŸŒ Accelerate

Local SGD is a technique for distributed training where gradients are not synchronized every step. Thus, each process updates its own version of the model weights and after a given number of steps these weights are synchronized by averaging across all processes. This improves communication efficiency and can lead to substantial training speed up especially when a computer lacks a faster interconnect such as NVLink. Unlike gradient accumulation (where improving communication efficiency requires increasing the effective batch size), Local SGD does not require changing a batch size or a learning rate / schedule. However, if necessary, Local SGD can be combined with gradient accumulation as well.

In this tutorial you will see how to quickly setup Local SGD ๐ŸŒ Accelerate. Compared to a standard Accelerate setup, this requires only two extra lines of code.

This example will use a very simplistic PyTorch training loop that performs gradient accumulation every two batches:

Copied

device = "cuda"
model.to(device)

gradient_accumulation_steps = 2

for index, batch in enumerate(training_dataloader):
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss = loss / gradient_accumulation_steps
    loss.backward()
    if (index + 1) % gradient_accumulation_steps == 0:
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

Converting it to ๐ŸŒ Accelerate

First the code shown earlier will be converted to use ๐ŸŒ Accelerate with neither a LocalSGD or a gradient accumulation helper:

Copied

+ from accelerate import Accelerator
+ accelerator = Accelerator()

+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+     model, optimizer, training_dataloader, scheduler
+ )

  for index, batch in enumerate(training_dataloader):
      inputs, targets = batch
-     inputs = inputs.to(device)
-     targets = targets.to(device)
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
      loss = loss / gradient_accumulation_steps
+     accelerator.backward(loss)
      if (index+1) % gradient_accumulation_steps == 0:
          optimizer.step()
          scheduler.step()

Letting ๐ŸŒ Accelerate handle model synchronization

All that is left now is to let ๐ŸŒ Accelerate handle model parameter synchronization and the gradient accumulation for us. For simplicity let us assume we need to synchronize every 8 steps. This is achieved by adding one with LocalSGD statement and one call local_sgd.step() after every optimizer step:

Copied

+local_sgd_steps=8

+with LocalSGD(accelerator=accelerator, model=model, local_sgd_steps=8, enabled=True) as local_sgd:
    for batch in training_dataloader:
        with accelerator.accumulate(model):
            inputs, targets = batch
            outputs = model(inputs)
            loss = loss_function(outputs, targets)
            accelerator.backward(loss)
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
+           local_sgd.step()

Under the hood, the Local SGD code disables automatic gradient synchornization (but accumulation still works as expected!). Instead it averages model parameters every local_sgd_steps steps (as well as in the end of the training loop).

Limitations

References

Although we are not aware of the true origins of this simple approach, the idea of local SGD is quite old and goes back to at least:

We credit the term Local SGD to the following paper (but there might be earlier references we are not aware of).

PreviousPerforming gradient accumulationNextSaving and loading training states

Last updated 1 year ago

The current implementation works only with basic multi-GPU (or multi-CPU) training without, e.g., .

Zhang, J., De Sa, C., Mitliagkas, I., & Rรฉ, C. (2016).

Stich, Sebastian Urban. ๐ŸŒ

๐ŸŒ
DeepSpeed.
Parallel SGD: When does averaging help?. arXiv preprint arXiv:1606.07365.
โ€œLocal SGD Converges Fast and Communicates Little.โ€ ICLR 2019-International Conference on Learning Representations. No. CONF. 2019.