Accelerate
  • 🌍GETTING STARTED
    • BOINC AI Accelerate
    • Installation
    • Quicktour
  • 🌍TUTORIALS
    • Overview
    • Migrating to BOINC AI Accelerate
    • Launching distributed code
    • Launching distributed training from Jupyter Notebooks
  • 🌍HOW-TO GUIDES
    • Start Here!
    • Example Zoo
    • How to perform inference on large models with small resources
    • Knowing how big of a model you can fit into memory
    • How to quantize model
    • How to perform distributed inference with normal resources
    • Performing gradient accumulation
    • Accelerating training with local SGD
    • Saving and loading training states
    • Using experiment trackers
    • Debugging timeout errors
    • How to avoid CUDA Out-of-Memory
    • How to use Apple Silicon M1 GPUs
    • How to use DeepSpeed
    • How to use Fully Sharded Data Parallelism
    • How to use Megatron-LM
    • How to use BOINC AI Accelerate with SageMaker
    • How to use BOINC AI Accelerate with Intel® Extension for PyTorch for cpu
  • 🌍CONCEPTS AND FUNDAMENTALS
    • BOINC AI Accelerate's internal mechanism
    • Loading big models into memory
    • Comparing performance across distributed setups
    • Executing and deferring jobs
    • Gradient synchronization
    • TPU best practices
  • 🌍REFERENCE
    • Main Accelerator class
    • Stateful configuration classes
    • The Command Line
    • Torch wrapper classes
    • Experiment trackers
    • Distributed launchers
    • DeepSpeed utilities
    • Logging
    • Working with large models
    • Kwargs handlers
    • Utility functions and classes
    • Megatron-LM Utilities
    • Fully Sharded Data Parallelism Utilities
Powered by GitBook
On this page
  1. CONCEPTS AND FUNDAMENTALS

BOINC AI Accelerate's internal mechanism

PreviousCONCEPTS AND FUNDAMENTALSNextLoading big models into memory

Last updated 1 year ago

🌍 Accelerate’s internal mechanisms

Internally, 🌍 Accelerate works by first analyzing the environment in which the script is launched to determine which kind of distributed setup is used, how many different processes there are and which one the current script is in. All that information is stored in the ~AcceleratorState.

This class is initialized the first time you instantiate an as well as performing any specific initialization your distributed setup needs. Its state is then uniquely shared through all instances of . (The same can also be done with the , a more barebones version it inherits)

Then, when calling , the library:

  • wraps your model(s) in the container adapted for the distributed setup,

  • wraps your optimizer(s) in an ,

  • wraps your scheduler(s) in an

  • creates a new version of your dataloader(s) in a or

While the model(s), optimizer(s), and scheduler(s) are just put in simple wrappers, the dataloader(s) are re-created. This is mostly because PyTorch does not let the user change the batch_sampler of a dataloader once it’s been created and the library handles the sharding of your data between processes by changing that batch_sampler to yield every other num_processes batches (if enabled).

The subclasses DataLoader to add the following functionality:

  • it synchronizes the appropriate random number generator of all processes at each new iteration, to ensure any randomization (like shuffling) is done the exact same way across processes.

  • it puts the batches on the proper device before yielding them (unless you have opted out of device_placement=True).

The subclasses differs from the in that when iterating through the DataLoader, the data is all starting from process 0 and then split and sent off to each process rather than it happening at the dataset level.

The random number generator synchronization will by default synchronize:

  • the generator attribute of a given sampler (like the PyTorch RandomSampler) for PyTorch >= 1.6

  • the main random number generator in PyTorch <=1.5.1

Synchronization of the main torch (or CUDA or XLA) random number generator will affect any other potential random artifacts you could have in your dataset (like random data augmentation) in the sense that all processes will get the same random numbers from the torch random modules (so will apply the same random data augmentation if it’s controlled by torch).

The randomization part of your custom sampler, batch sampler or iterable dataset should be done using a local torch.Generator object (in PyTorch >= 1.6), see the traditional RandomSampler, as an example.

You can choose which random number generator(s) to synchronize with the rng_types argument of the main . In PyTorch >= 1.6, it is recommended to rely on a local generator to avoid setting the same seed in the main random number generator in all processes.

For more details about the internals, see the .

🌍
~Accelerator
AcceleratorState
PartialState
prepare()
AcceleratedOptimizer
AcceleratedScheduler
DataLoaderShard
DataLoaderDispatcher
DataLoaderShard
DataLoaderDispatcher
DataLoaderShard
Accelerator
Internals page