Accelerate
  • ๐ŸŒGETTING STARTED
    • BOINC AI Accelerate
    • Installation
    • Quicktour
  • ๐ŸŒTUTORIALS
    • Overview
    • Migrating to BOINC AI Accelerate
    • Launching distributed code
    • Launching distributed training from Jupyter Notebooks
  • ๐ŸŒHOW-TO GUIDES
    • Start Here!
    • Example Zoo
    • How to perform inference on large models with small resources
    • Knowing how big of a model you can fit into memory
    • How to quantize model
    • How to perform distributed inference with normal resources
    • Performing gradient accumulation
    • Accelerating training with local SGD
    • Saving and loading training states
    • Using experiment trackers
    • Debugging timeout errors
    • How to avoid CUDA Out-of-Memory
    • How to use Apple Silicon M1 GPUs
    • How to use DeepSpeed
    • How to use Fully Sharded Data Parallelism
    • How to use Megatron-LM
    • How to use BOINC AI Accelerate with SageMaker
    • How to use BOINC AI Accelerate with Intelยฎ Extension for PyTorch for cpu
  • ๐ŸŒCONCEPTS AND FUNDAMENTALS
    • BOINC AI Accelerate's internal mechanism
    • Loading big models into memory
    • Comparing performance across distributed setups
    • Executing and deferring jobs
    • Gradient synchronization
    • TPU best practices
  • ๐ŸŒREFERENCE
    • Main Accelerator class
    • Stateful configuration classes
    • The Command Line
    • Torch wrapper classes
    • Experiment trackers
    • Distributed launchers
    • DeepSpeed utilities
    • Logging
    • Working with large models
    • Kwargs handlers
    • Utility functions and classes
    • Megatron-LM Utilities
    • Fully Sharded Data Parallelism Utilities
Powered by GitBook
On this page
  • Performing gradient accumulation with ๐ŸŒ Accelerate
  • Converting it to ๐ŸŒ Accelerate
  • Letting ๐ŸŒ Accelerate handle gradient accumulation
  • The finished code
  • Self-contained example
  1. HOW-TO GUIDES

Performing gradient accumulation

Performing gradient accumulation with ๐ŸŒ Accelerate

Gradient accumulation is a technique where you can train on bigger batch sizes than your machine would normally be able to fit into memory. This is done by accumulating gradients over several batches, and only stepping the optimizer after a certain number of batches have been performed.

While technically standard gradient accumulation code would work fine in a distributed setup, it is not the most efficient method for doing so and you may experience considerable slowdowns!

In this tutorial you will see how to quickly setup gradient accumulation and perform it with the utilities provided in ๐ŸŒ Accelerate, which can total to adding just one new line of code!

This example will use a very simplistic PyTorch training loop that performs gradient accumulation every two batches:

Copied

device = "cuda"
model.to(device)

gradient_accumulation_steps = 2

for index, batch in enumerate(training_dataloader):
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss = loss / gradient_accumulation_steps
    loss.backward()
    if (index + 1) % gradient_accumulation_steps == 0:
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

Converting it to ๐ŸŒ Accelerate

First the code shown earlier will be converted to utilize ๐ŸŒ Accelerate without the special gradient accumulation helper:

Copied

+ from accelerate import Accelerator
+ accelerator = Accelerator()

+ model, optimizer, training_dataloader, scheduler = accelerator.prepare(
+     model, optimizer, training_dataloader, scheduler
+ )

  for index, batch in enumerate(training_dataloader):
      inputs, targets = batch
-     inputs = inputs.to(device)
-     targets = targets.to(device)
      outputs = model(inputs)
      loss = loss_function(outputs, targets)
      loss = loss / gradient_accumulation_steps
+     accelerator.backward(loss)
      if (index+1) % gradient_accumulation_steps == 0:
          optimizer.step()
          scheduler.step()
          optimizer.zero_grad()

Letting ๐ŸŒ Accelerate handle gradient accumulation

Copied

  from accelerate import Accelerator
- accelerator = Accelerator()
+ accelerator = Accelerator(gradient_accumulation_steps=2)

Copied

- for index, batch in enumerate(training_dataloader):
+ for batch in training_dataloader:
+     with accelerator.accumulate(model):
          inputs, targets = batch
          outputs = model(inputs)

You can remove all the special checks for the step number and the loss adjustment:

Copied

- loss = loss / gradient_accumulation_steps
  accelerator.backward(loss)
- if (index+1) % gradient_accumulation_steps == 0:
  optimizer.step()
  scheduler.step()
  optimizer.zero_grad()

Typically with gradient accumulation, you would need to adjust the number of steps to reflect the change in total batches you are training on. ๐ŸŒ Accelerate automagically does this for you by default. Behind the scenes we instantiate a GradientAccumulationPlugin configured to do this.

The finished code

Below is the finished implementation for performing gradient accumulation with ๐ŸŒ Accelerate

Copied

from accelerate import Accelerator
accelerator = Accelerator(gradient_accumulation_steps=2)
model, optimizer, training_dataloader, scheduler = accelerator.prepare(
    model, optimizer, training_dataloader, scheduler
)
for batch in training_dataloader:
    with accelerator.accumulate(model):
        inputs, targets = batch
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        accelerator.backward(loss)
        optimizer.step()
        scheduler.step()
        optimizer.zero_grad()

Itโ€™s important that only one forward/backward should be done inside the context manager with accelerator.accumulate(model).

Self-contained example

Here is a self-contained example that you can run to see gradient accumulation in action with ๐ŸŒ Accelerate:

Copied

import torch
import copy
from accelerate import Accelerator
from accelerate.utils import set_seed
from torch.utils.data import TensorDataset, DataLoader

# seed
set_seed(0)

# define toy inputs and labels
x = torch.tensor([1., 2., 3., 4., 5., 6., 7., 8.])
y = torch.tensor([2., 4., 6., 8., 10., 12., 14., 16.])
gradient_accumulation_steps = 4
batch_size = len(x) // gradient_accumulation_steps

# define dataset and dataloader
dataset = TensorDataset(x, y)
dataloader = DataLoader(dataset, batch_size=batch_size)

# define model, optimizer and loss function
model = torch.zeros((1, 1), requires_grad=True)
model_clone = copy.deepcopy(model)
criterion = torch.nn.MSELoss()
model_optimizer = torch.optim.SGD([model], lr=0.02)
accelerator = Accelerator(gradient_accumulation_steps=gradient_accumulation_steps)
model, model_optimizer, dataloader = accelerator.prepare(model, model_optimizer, dataloader)
model_clone_optimizer = torch.optim.SGD([model_clone], lr=0.02)
print(f"initial model weight is {model.mean().item():.5f}")
print(f"initial model weight is {model_clone.mean().item():.5f}")
for i, (inputs, labels) in enumerate(dataloader):
    with accelerator.accumulate(model):
        inputs = inputs.view(-1, 1)
        print(i, inputs.flatten())
        labels = labels.view(-1, 1)
        outputs = inputs @ model
        loss = criterion(outputs, labels)
        accelerator.backward(loss)
        model_optimizer.step()
        model_optimizer.zero_grad()
loss = criterion(x.view(-1, 1) @ model_clone, y.view(-1, 1))
model_clone_optimizer.zero_grad()
loss.backward()
model_clone_optimizer.step()
print(f"w/ accumulation, the final model weight is {model.mean().item():.5f}")
print(f"w/o accumulation, the final model weight is {model_clone.mean().item():.5f}")

Copied

initial model weight is 0.00000
initial model weight is 0.00000
0 tensor([1., 2.])
1 tensor([3., 4.])
2 tensor([5., 6.])
3 tensor([7., 8.])
w/ accumulation, the final model weight is 2.04000
w/o accumulation, the final model weight is 2.04000
PreviousHow to perform distributed inference with normal resourcesNextAccelerating training with local SGD

Last updated 1 year ago

In its current state, this code is not going to perform gradient accumulation efficiently due to a process called gradient synchronization. Read more about that in the !

All that is left now is to let ๐ŸŒ Accelerate handle the gradient accumulation for us. To do so you should pass in a gradient_accumulation_steps parameter to , dictating the number of steps to perform before each call to step() and how to automatically adjust the loss during the call to :

Alternatively, you can pass in a gradient_accumulation_plugin parameter to the objectโ€™s __init__, which will allow you to further customize the gradient accumulation behavior. Read more about that in the docs.

From here you can use the context manager from inside your training loop to automatically perform the gradient accumulation for you! You just wrap it around the entire training part of our code:

As you can see the is able to keep track of the batch number you are on and it will automatically know whether to step through the prepared optimizer and how to adjust the loss.

To learn more about what magic this wraps around, read the

๐ŸŒ
Concepts tutorial
Accelerator
backward()
Accelerator
GradientAccumulationPlugin
accumulate()
Accelerator
Gradient Synchronization concept guide