Accelerate
  • 🌍GETTING STARTED
    • BOINC AI Accelerate
    • Installation
    • Quicktour
  • 🌍TUTORIALS
    • Overview
    • Migrating to BOINC AI Accelerate
    • Launching distributed code
    • Launching distributed training from Jupyter Notebooks
  • 🌍HOW-TO GUIDES
    • Start Here!
    • Example Zoo
    • How to perform inference on large models with small resources
    • Knowing how big of a model you can fit into memory
    • How to quantize model
    • How to perform distributed inference with normal resources
    • Performing gradient accumulation
    • Accelerating training with local SGD
    • Saving and loading training states
    • Using experiment trackers
    • Debugging timeout errors
    • How to avoid CUDA Out-of-Memory
    • How to use Apple Silicon M1 GPUs
    • How to use DeepSpeed
    • How to use Fully Sharded Data Parallelism
    • How to use Megatron-LM
    • How to use BOINC AI Accelerate with SageMaker
    • How to use BOINC AI Accelerate with Intel® Extension for PyTorch for cpu
  • 🌍CONCEPTS AND FUNDAMENTALS
    • BOINC AI Accelerate's internal mechanism
    • Loading big models into memory
    • Comparing performance across distributed setups
    • Executing and deferring jobs
    • Gradient synchronization
    • TPU best practices
  • 🌍REFERENCE
    • Main Accelerator class
    • Stateful configuration classes
    • The Command Line
    • Torch wrapper classes
    • Experiment trackers
    • Distributed launchers
    • DeepSpeed utilities
    • Logging
    • Working with large models
    • Kwargs handlers
    • Utility functions and classes
    • Megatron-LM Utilities
    • Fully Sharded Data Parallelism Utilities
Powered by GitBook
On this page
  • Debugging Distributed Operations
  • Visualizing the problem
  • The solution
  1. HOW-TO GUIDES

Debugging timeout errors

PreviousUsing experiment trackersNextHow to avoid CUDA Out-of-Memory

Last updated 1 year ago

Debugging Distributed Operations

When running scripts in a distributed fashion, often functions such as and (and others) are neccessary to grab tensors across devices and perform certain operations on them. However, if the tensors which are being grabbed are not the proper shapes then this will result in your code hanging forever. The only sign that exists of this truly happening is hitting a timeout exception from torch.distributed, but this can get quite costly as usually the timeout is 10 minutes.

Accelerate now has a debug mode which adds a neglible amount of time to each operation, but allows it to verify that the inputs you are bringing in can actually perform the operation you want without hitting this timeout problem!

Visualizing the problem

To have a tangible example of this issue, let’s take the following setup (on 2 GPUs):

Copied

from accelerate import PartialState

state = PartialState()
if state.process_index == 0:
    tensor = torch.tensor([[0.0, 1, 2, 3, 4]]).to(state.device)
else:
    tensor = torch.tensor([[[0.0, 1, 2, 3, 4], [5, 6, 7, 8, 9]]]).to(state.device)

broadcast_tensor = broadcast(tensor)
print(broadcast_tensor)

We’ve created a single tensor on each device, with two radically different shapes. With this setup if we want to perform an operation such as , we would forever hit a timeout because torch.distributed requires that these operations have the exact same shape across all processes for it to work.

If you run this yourself, you will find that broadcast_tensor can be printed on the main process, but its results won’t quite be right, and then it will just hang never printing it on any of the other processes:

Copied

>>> tensor([[0, 1, 2, 3, 4]], device='cuda:0')

The solution

By enabling Accelerate’s operational debug mode, Accelerate will properly find and catch errors such as this and provide a very clear traceback immediatly:

Copied

Traceback (most recent call last):
  File "/home/zach_mueller_huggingface_co/test.py", line 18, in <module>
    main()
  File "/home/zach_mueller_huggingface_co/test.py", line 15, in main
        main()broadcast_tensor = broadcast(tensor)
  File "/home/zach_mueller_huggingface_co/accelerate/src/accelerate/utils/operations.py", line 303, in wrapper
    broadcast_tensor = broadcast(tensor)
accelerate.utils.operations.DistributedOperationException: Cannot apply desired operation due to shape mismatches. All shapes across devices must be valid.

Operation: `accelerate.utils.operations.broadcast`
Input shapes:
  - Process 0: [1, 5]
  - Process 1: [1, 2, 5]

This explains that the shapes across our devices were not the same, and that we should ensure that they match properly to be compatible. Typically this means that there is either an extra dimension, or certain dimensions are incompatible with the operation.

To enable this please do one of the following:

Enable it through the questionarre during accelerate config (recommended)

From the CLI:

Copied

accelerate launch --debug {my_script.py} --arg1 --arg2

As an environmental variable (which avoids the need for accelerate launch):

Copied

ACCELERATE_DEBUG_MODE="1" accelerate launch {my_script.py} --arg1 --arg2

Manually changing the config.yaml file:

Copied

 compute_environment: LOCAL_MACHINE
+debug: true
🌍
Accelerator.gather()
Accelerator.reduce()
utils.broadcast()