How to perform distributed inference with normal resources
Distributed Inference with ๐ Accelerate
Distributed inference is a common use case, especially with natural language processing (NLP) models. Users often want to send a number of different prompts, each to a different GPU, and then get the results back. This also has other cases outside of just NLP, however for this tutorial we will focus on just this idea of each GPU receiving a different prompt, and then returning the results.
The Problem
Normally when doing this, users send the model to a specific device to load it from the CPU, and then move each prompt to a different device.
A basic pipeline using the diffusers
library might look something like so:
Copied
Followed then by performing inference based on the specific prompt:
Copied
One will notice how we have to check the rank to know what prompt to send, which can be a bit tedious.
A user might then also think that with ๐ Accelerate, using the Accelerator
to prepare a dataloader for such a task might also be a simple way to manage this. (To learn more, check out the relvent section in the Quick Tour)
Can it manage it? Yes. Does it add unneeded extra code however: also yes.
The Solution
With ๐ Accelerate, we can simplify this process by using the Accelerator.split_between_processes() context manager (which also exists in PartialState
and AcceleratorState
). This function will automatically split whatever data you pass to it (be it a prompt, a set of tensors, a dictionary of the prior data, etc.) across all the processes (with a potential to be padded) for you to use right away.
Letโs rewrite the above example using this context manager:
Copied
And then to launch the code, we can use the ๐ Accelerate:
If you have generated a config file to be used using accelerate config
:
Copied
If you have a specific config file you want to use:
Copied
Or if donโt want to make any config files and launch on two GPUs:
Note: You will get some warnings about values being guessed based on your system. To remove these you can do
accelerate config default
or go throughaccelerate config
to create a config file.
Copied
Weโve now reduced the boilerplate code needed to split this data to a few lines of code quite easily.
But what if we have an odd distribution of prompts to GPUs? For example, what if we have 3 prompts, but only 2 GPUs?
Under the context manager, the first GPU would receive the first two prompts and the second GPU the third, ensuring that all prompts are split and no overhead is needed.
However, what if we then wanted to do something with the results of all the GPUs? (Say gather them all and perform some kind of post processing) You can pass in apply_padding=True
to ensure that the lists of prompts are padded to the same length, with extra data being taken from the last sample. This way all GPUs will have the same number of prompts, and you can then gather the results.
This is only needed when trying to perform an action such as gathering the results, where the data on each device needs to be the same length. Basic inference does not require this.
For instance:
Copied
On the first GPU, the prompts will be ["a dog", "a cat"]
, and on the second GPU it will be ["a chicken", "a chicken"]
. Make sure to drop the final sample, as it will be a duplicate of the previous one.
Last updated