Accelerate
  • 🌍GETTING STARTED
    • BOINC AI Accelerate
    • Installation
    • Quicktour
  • 🌍TUTORIALS
    • Overview
    • Migrating to BOINC AI Accelerate
    • Launching distributed code
    • Launching distributed training from Jupyter Notebooks
  • 🌍HOW-TO GUIDES
    • Start Here!
    • Example Zoo
    • How to perform inference on large models with small resources
    • Knowing how big of a model you can fit into memory
    • How to quantize model
    • How to perform distributed inference with normal resources
    • Performing gradient accumulation
    • Accelerating training with local SGD
    • Saving and loading training states
    • Using experiment trackers
    • Debugging timeout errors
    • How to avoid CUDA Out-of-Memory
    • How to use Apple Silicon M1 GPUs
    • How to use DeepSpeed
    • How to use Fully Sharded Data Parallelism
    • How to use Megatron-LM
    • How to use BOINC AI Accelerate with SageMaker
    • How to use BOINC AI Accelerate with Intel® Extension for PyTorch for cpu
  • 🌍CONCEPTS AND FUNDAMENTALS
    • BOINC AI Accelerate's internal mechanism
    • Loading big models into memory
    • Comparing performance across distributed setups
    • Executing and deferring jobs
    • Gradient synchronization
    • TPU best practices
  • 🌍REFERENCE
    • Main Accelerator class
    • Stateful configuration classes
    • The Command Line
    • Torch wrapper classes
    • Experiment trackers
    • Distributed launchers
    • DeepSpeed utilities
    • Logging
    • Working with large models
    • Kwargs handlers
    • Utility functions and classes
    • Megatron-LM Utilities
    • Fully Sharded Data Parallelism Utilities
Powered by GitBook
On this page
  • Comparing performance between different device setups
  • Setting the Seed
  • Observed Batch Sizes
  • Learning Rates
  • Gradient Accumulation and Mixed Precision
  1. CONCEPTS AND FUNDAMENTALS

Comparing performance across distributed setups

PreviousLoading big models into memoryNextExecuting and deferring jobs

Last updated 1 year ago

Comparing performance between different device setups

Evaluating and comparing the performance from different setups can be quite tricky if you don’t know what to look for. For example, you cannot run the same script with the same batch size across TPU, multi-GPU, and single-GPU with Accelerate and expect your results to line up.

But why?

There are three reasons for this that this tutorial will cover:

  1. Setting the right seeds

  2. Observed Batch Sizes

  3. Learning Rates

Setting the Seed

While this issue has not come up as much, make sure to use to fully set the seed in all distributed cases so training will be reproducible:

Copied

from accelerate.utils import set_seed

set_seed(42)

Why is this important? Under the hood this will set 5 different seed settings:

Copied

    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    # ^^ safe to call this function even if cuda is not available
    if is_tpu_available():
        xm.set_rng_state(seed)

The random state, numpy’s state, torch, torch’s cuda state, and if TPUs are available torch_xla’s cuda state.

Observed Batch Sizes

When training with Accelerate, the batch size passed to the dataloader is the batch size per GPU. What this entails is a batch size of 64 on two GPUs is truly a batch size of 128. As a result, when testing on a single GPU this needs to be accounted for, as well as similarly for TPUs.

The below table can be used as a quick reference to try out different batch sizes:

In this example, there are two GPUs for “Multi-GPU” and a TPU pod with 8 workers

Single GPU Batch Size
Multi-GPU Equivalent Batch Size
TPU Equivalent Batch Size

256

128

32

128

64

16

64

32

8

32

16

4

Learning Rates

Since users can have their own learning rate schedulers defined, we leave this up to the user to decide if they wish to scale their learning rate or not.

Copied

learning_rate = 1e-3
accelerator = Accelerator()
learning_rate *= accelerator.num_processes

optimizer = AdamW(params=model.parameters(), lr=learning_rate)

You will also find that accelerate will step the learning rate based on the number of processes being trained on. This is because of the observed batch size noted earlier. So in the case of 2 GPUs, the learning rate will be stepped twice as often as a single GPU to account for the batch size being twice as large (if no changes to the batch size on the single GPU instance are made).

Gradient Accumulation and Mixed Precision

When using gradient accumulation and mixed precision, due to how gradient averaging works (accumulation) and the precision loss (mixed precision), some degradation in performance is expected. This will be explicitly seen when comparing the batch-wise loss between different compute setups. However, the overall loss, metric, and general performance at the end of training should be roughly the same.

As noted in multiple sources[][], the learning rate should be scaled linearly based on the number of devices present. The below snippet shows doing so with Accelerate:

🌍
utils.set_seed()
1
2