Accelerate
  • 🌍GETTING STARTED
    • BOINC AI Accelerate
    • Installation
    • Quicktour
  • 🌍TUTORIALS
    • Overview
    • Migrating to BOINC AI Accelerate
    • Launching distributed code
    • Launching distributed training from Jupyter Notebooks
  • 🌍HOW-TO GUIDES
    • Start Here!
    • Example Zoo
    • How to perform inference on large models with small resources
    • Knowing how big of a model you can fit into memory
    • How to quantize model
    • How to perform distributed inference with normal resources
    • Performing gradient accumulation
    • Accelerating training with local SGD
    • Saving and loading training states
    • Using experiment trackers
    • Debugging timeout errors
    • How to avoid CUDA Out-of-Memory
    • How to use Apple Silicon M1 GPUs
    • How to use DeepSpeed
    • How to use Fully Sharded Data Parallelism
    • How to use Megatron-LM
    • How to use BOINC AI Accelerate with SageMaker
    • How to use BOINC AI Accelerate with Intel® Extension for PyTorch for cpu
  • 🌍CONCEPTS AND FUNDAMENTALS
    • BOINC AI Accelerate's internal mechanism
    • Loading big models into memory
    • Comparing performance across distributed setups
    • Executing and deferring jobs
    • Gradient synchronization
    • TPU best practices
  • 🌍REFERENCE
    • Main Accelerator class
    • Stateful configuration classes
    • The Command Line
    • Torch wrapper classes
    • Experiment trackers
    • Distributed launchers
    • DeepSpeed utilities
    • Logging
    • Working with large models
    • Kwargs handlers
    • Utility functions and classes
    • Megatron-LM Utilities
    • Fully Sharded Data Parallelism Utilities
Powered by GitBook
On this page
  • Gradient Synchronization
  • The slowdown in gradient accumulation
  • Solving the slowdown problem
  • Just how much of a slowdown is there, and easy mistakes you can make
  1. CONCEPTS AND FUNDAMENTALS

Gradient synchronization

PreviousExecuting and deferring jobsNextTPU best practices

Last updated 1 year ago

Gradient Synchronization

PyTorch’s distributed module operates by communicating back and forth between all of the GPUs in your system. This communication takes time, and ensuring all processes know the states of each other happens at particular triggerpoints when using the ddp module.

These triggerpoints are added to the PyTorch model, specifically their forward() and backward() methods. This happens when the model is wrapped with DistributedDataParallel:

Copied

import torch.nn as nn
from torch.nn.parallel import DistributedDataParallel

model = nn.Linear(10, 10)
ddp_model = DistributedDataParallel(model)

In 🌍 Accelerate this conversion happens automatically when calling and passing in your model.

Copied

+ from accelerate import Accelerator
+ accelerator = Accelerator()
  import torch.nn as nn
- from torch.nn.parallel import DistributedDataParallel

  model = nn.Linear(10,10)
+ model = accelerator.prepare(model)

The slowdown in gradient accumulation

You now understand that PyTorch adds hooks to the forward and backward method of your PyTorch model when training in a distributed setup. But how does this risk slowing down your code?

In DDP (distributed data parallel), the specific order in which processes are performed and ran are expected at specific points and these must also occur at roughly the same time before moving on.

The most direct example is when you update model parameters through optimizer.step(). Without gradient accumulation, all instances of the model need to have updated their gradients computed, collated, and updated before moving on to the next batch of data. When performing gradient accumulation, you accumulate n loss gradients and skip optimizer.step() until n batches have been reached. As all training processes only need to sychronize by the time optimizer.step() is called, without any modification to your training step, this neededless inter-process communication can cause a significant slowdown.

How can you avoid this overhead?

Solving the slowdown problem

Under this context manager, PyTorch will skip synchronizing the gradients when .backward() is called, and the first call to .backward() outside this context manager will trigger the synchronization. See an example below:

Copied

ddp_model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer)

for index, batch in enumerate(dataloader):
    inputs, targets = batch
    # Trigger gradient synchronization on the last batch
    if index != (len(dataloader) - 1):
        with ddp_model.no_sync():
            # Gradients only accumulate
            outputs = ddp_model(inputs)
            loss = loss_func(outputs)
            accelerator.backward(loss)
    else:
        # Gradients finally sync
        outputs = ddp_model(inputs)
        loss = loss_func(outputs)
        accelerator.backward(loss)
        optimizer.step()

Copied

  ddp_model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer)

  for index, batch in enumerate(dataloader):
      inputs, targets = batch
      # Trigger gradient synchronization on the last batch
      if index != (len(dataloader)-1):
-         with ddp_model.no_sync():
+         with accelerator.no_sync(model):
              # Gradients only accumulate
              outputs = ddp_model(inputs)
              loss = loss_func(outputs, targets)
              accelerator.backward(loss)
      else:
          # Gradients finally sync
          outputs = ddp_model(inputs)
          loss = loss_func(outputs)
          accelerator.backward(loss)
          optimizer.step()
          optimizer.zero_grad()

Copied

ddp_model, dataloader, optimizer = accelerator.prepare(model, dataloader, optimizer)

for batch in dataloader:
    with accelerator.accumulate(model):
        optimizer.zero_grad()
        inputs, targets = batch
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        accelerator.backward(loss)
        optimizer.step()
        optimizer.zero_grad()

As a result, you should either use accelerator.accumulate or accelerator.no_sync when it comes to API choice.

Just how much of a slowdown is there, and easy mistakes you can make

To set up a realistic example, consider the following setup:

  • Two single-GPU T4 nodes and one node with two GPUs

  • Each GPU is a T4, and are hosted on GCP

  • Batch size per GPU is 16, and gradients are accumulated every 4 steps

If not careful about gradient synchronization and GPU communication, a large amount of time can be wasted from when these GPUs communicate to each other during unnecessary periods.

By how much?

Reference:

  • Baseline: uses no synchronization practices discussed here

  • no_sync improperly: no_sync only around the backward call, not the forward

  • no_sync: using the no_sync pattern properly

Below are the average seconds per batch iterating over 29 batches of data for each setup on both a single node and on the dual-node setup:

Baseline

no_sync improperly

no_sync

accumulate

Multi-Node

2±0.01s

2.13±0.08s

0.91±0.11s

0.91±0.11s

Single Node

0.50±0.01s

0.50±0.01s

0.41±0.015s

0.41±0.015s

As you can see, if you are not careful about how you set up your gradient synchronization, you can get upwards of more than a 2x slowdown during training!

Since you are skipping model parameter updates when training on these batches, their gradients do not need to be synchronized until the point where optimizer.step() is actually called. PyTorch cannot automagically tell when you need to do this, but they do provide a tool to help through the context manager that is added to your model after converting it to DDP.

In 🌍 Accelerate to make this an API that can be called no matter the training device (though it may not do anything if you are not in a distributed system!), ddp_model.no_sync gets replaced with and operates the same way:

As you may expect, the function wraps around this conditional check by keeping track of the current batch number, leaving you with the final gradient accumulation API:

The script used is a modification of the script

All scripts are available in .

accumulate: using properly

If you are worried about making sure everything is done properly, we highly recommend utilizing the function and passing in gradient_accumulation_steps or gradient_accumulation_plugin to the object so Accelerate can handle this for you.

🌍
prepare()
no_sync
no_sync()
accumulate()
NLP Example
this repository
accumulate()
accumulate()
Accelerator