Accelerate
  • 🌍GETTING STARTED
    • BOINC AI Accelerate
    • Installation
    • Quicktour
  • 🌍TUTORIALS
    • Overview
    • Migrating to BOINC AI Accelerate
    • Launching distributed code
    • Launching distributed training from Jupyter Notebooks
  • 🌍HOW-TO GUIDES
    • Start Here!
    • Example Zoo
    • How to perform inference on large models with small resources
    • Knowing how big of a model you can fit into memory
    • How to quantize model
    • How to perform distributed inference with normal resources
    • Performing gradient accumulation
    • Accelerating training with local SGD
    • Saving and loading training states
    • Using experiment trackers
    • Debugging timeout errors
    • How to avoid CUDA Out-of-Memory
    • How to use Apple Silicon M1 GPUs
    • How to use DeepSpeed
    • How to use Fully Sharded Data Parallelism
    • How to use Megatron-LM
    • How to use BOINC AI Accelerate with SageMaker
    • How to use BOINC AI Accelerate with Intel® Extension for PyTorch for cpu
  • 🌍CONCEPTS AND FUNDAMENTALS
    • BOINC AI Accelerate's internal mechanism
    • Loading big models into memory
    • Comparing performance across distributed setups
    • Executing and deferring jobs
    • Gradient synchronization
    • TPU best practices
  • 🌍REFERENCE
    • Main Accelerator class
    • Stateful configuration classes
    • The Command Line
    • Torch wrapper classes
    • Experiment trackers
    • Distributed launchers
    • DeepSpeed utilities
    • Logging
    • Working with large models
    • Kwargs handlers
    • Utility functions and classes
    • Megatron-LM Utilities
    • Fully Sharded Data Parallelism Utilities
Powered by GitBook
On this page
  • Handling big models for inference
  • Using 🌍 Accelerate
  • Using 🌍 Transformers, 🌍 Diffusers, and other 🌍 Open Source Libraries
  • Where to go from here
  1. HOW-TO GUIDES

How to perform inference on large models with small resources

PreviousExample ZooNextKnowing how big of a model you can fit into memory

Last updated 1 year ago

Handling big models for inference

One of the biggest advancements 🌍 Accelerate provides is the concept of wherein you can perform inference on models that cannot fully fit on your graphics card.

This tutorial will be broken down into two parts showcasing how to use both 🌍 Accelerate and 🌍 Transformers (a higher API-level) to make use of this idea.

Using 🌍 Accelerate

For these tutorials, we’ll assume a typical workflow for loading your model in such that:

Copied

import torch

my_model = ModelClass(...)
state_dict = torch.load(checkpoint_file)
my_model.load_state_dict(state_dict)

Note that here we assume that ModelClass is a model that takes up more video-card memory than what can fit on your device (be it mps or cuda).

The first step is to init an empty skeleton of the model which won’t take up any RAM using the context manager:

Copied

from accelerate import init_empty_weights
with init_empty_weights():
    my_model = ModelClass(...)

With this my_model currently is “parameterless”, hence leaving the smaller footprint than what one would normally get loading this onto the CPU directly.

Next we need to load in the weights to our model so we can perform inference.

To determine how this dispatch can be performed, generally specifying device_map="auto" will be good enough as 🌍 Accelerate will attempt to fill all the space in your GPU(s), then loading them to the CPU, and finally if there is not enough RAM it will be loaded to the disk (the absolute slowest option).

See an example below:

Copied

from accelerate import load_checkpoint_and_dispatch

model = load_checkpoint_and_dispatch(
    model, checkpoint=checkpoint_file, device_map="auto"
)

Now that the model is dispatched fully, you can perform inference as normal with the model:

Copied

input = torch.randn(2,3)
input = input.to("cuda")
output = model(input)

What will happen now is each time the input gets passed through a layer, it will be sent from the CPU to the GPU (or disk to CPU to GPU), the output is calculated, and then the layer is pulled back off the GPU going back down the line. While this adds some overhead to the inference being performed, through this method it is possible to run any size model on your system, as long as the largest layer is capable of fitting on your GPU.

Multiple GPUs can be utilized, however this is considered “model parallism” and as a result only one GPU will be active at a given moment, waiting for the prior one to send it the output. You should launch your script normally with python and not need torchrun, accelerate launch, etc.

For a visual representation of this, check out the animation below:

Complete Example

Below is the full example showcasing what we performed above:

Copied

import torch
from accelerate import init_empty_weights, load_checkpoint_and_dispatch

with init_empty_weights():
    model = MyModel(...)

model = load_checkpoint_and_dispatch(
    model, checkpoint=checkpoint_file, device_map="auto"
)

input = torch.randn(2,3)
input = input.to("cuda")
output = model(input)

Using 🌍 Transformers, 🌍 Diffusers, and other 🌍 Open Source Libraries

Libraries that support 🌍 Accelerate big model inference include all of the earlier logic in their from_pretrained constructors.

As a brief example, we will look at using transformers and loading in Big Science’s T0pp model.

Copied

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", device_map="auto")

After loading the model in, the initial steps from before to prepare a model have all been done and the model is fully ready to make use of all the resources in your machine. Through these constructors, you can also save more memory by specifying the precision the model is loaded into as well, through the torch_dtype parameter, such as:

Copied

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained("bigscience/T0pp", device_map="auto", torch_dtype=torch.float16)

Where to go from here

For this we will use , which as the name implies will load a checkpoint inside your empty model and dispatch the weights for each layer across all the devices you have available (GPU/MPS and CPU RAM).

For more details on desigining your own device map, see this section of the

If there are certain “chunks” of layers that shouldn’t be split, you can pass them in as no_split_module_classes. Read more about it

Also to save on memory (such as if the state_dict will not fit in RAM), a model’s weights can be divided and split into multiple checkpoint files. Read more about it

These operate by specifying a string representing the model to download from the 🌍 and then denoting device_map="auto" along with a few extra parameters.

To learn more about this, check out the 🌍 Transformers documentation available .

For a much more detailed look at big model inference, be sure to check out the

🌍
large model inference
init_empty_weights()
load_checkpoint_and_dispatch()
concept guide
here
here
Hub
here
Conceptual Guide on it