AWS provides two generations of the Inferentia accelerator built for machine learning inference with higher throughput, lower latency but lower cost: inf2 (NeuronCore-v2) and inf1 (NeuronCore-v1).
In production environments, to deploy 🤗 Transformers models on Neuron devices, you need to compile your models and export them to a serialized format before inference. Through Ahead-Of-Time (AOT) compilation with Neuron Compiler( neuronx-cc or neuron-cc ), your models will be converted to serialized and optimized TorchScript modules.
To understand a little bit more about the compilation, here are general steps executed under the hood:
NEFF: Neuron Executable File Format which is a binary executable on Neuron devices.
Although pre-compilation avoids overhead during the inference, traced Neuron module has some limitations:
Traced Neuron module will be static, which requires fixed input shapes and data types used passed during the compilation. As the model won’t be dynamically recompiled, the inference will fail if any of the above conditions change. (But these limitations could be bypass with dynamic batching and bucketing).
Neuron models are hardware-specialized, which means:
Models traced with Neuron can no longer be executed in non-Neuron environment.
Models compiled for inf1 (NeuronCore-v1) are not compatible with inf2 (NeuronCore-v2), and vice versa.
In this guide, we’ll show you how to export your models to serialized models optimized for Neuron devices.
🤗 Optimum provides support for the Neuron export by leveraging configuration objects. These configuration objects come ready made for a number of model architectures, and are designed to be easily extendable to other architectures.
To export a 🤗 Transformers model to Neuron, you’ll first need to install some extra dependencies:
For Inf2
Copied
pip install optimum[neuronx]
For Inf1
Copied
pip install optimum[neuron]
The Optimum Neuron export can be used through Optimum command-line:
Copied
optimum-cli export neuron --help
usage: optimum-cli export neuron [-h] -m MODEL [--task TASK] [--atol ATOL] [--cache_dir CACHE_DIR]
[--trust-remote-code] [--disable-validation] [--auto_cast {none,matmul,all}]
[--auto_cast_type {bf16,fp16,tf32}] [--dynamic-batch-size]
[--batch_size BATCH_SIZE] [--sequence_length SEQUENCE_LENGTH]
[--num_choices NUM_CHOICES] [--num_channels NUM_CHANNELS] [--width WIDTH]
[--height HEIGHT] [--num_images_per_prompt NUM_IMAGES_PER_PROMPT]
output
optional arguments:
-h, --help show this help message and exit
Required arguments:
-m MODEL, --model MODEL
Model ID on boincai.com or path on disk to load model from.
output Path indicating the directory where to store generated Neuronx compiled TorchScript model.
Optional arguments:
--task TASK The task to export the model for. If not specified, the task will be auto-inferred based on
the model. Available tasks depend on the model, but are among: ['conversational', 'feature-
extraction', 'fill-mask', 'text-generation', 'text2text-generation', 'text-classification',
'token-classification', 'multiple-choice', 'object-detection', 'question-answering',
'image-classification', 'image-segmentation', 'mask-generation', 'masked-im', 'semantic-
segmentation', 'automatic-speech-recognition', 'audio-classification', 'audio-frame-
classification', 'audio-xvector', 'image-to-text', 'stable-diffusion', 'stable-diffusion-
xl', 'zero-shot-image-classification', 'zero-shot-object-detection'].
--atol ATOL If specified, the absolute difference tolerance when validating the model. Otherwise, the
default atol for the model will be used.
--cache_dir CACHE_DIR
Path indicating where to store cache.
--trust-remote-code Allow to use custom code for the modeling hosted in the model repository. This option
should only be set for repositories you trust and in which you have read the code, as it
will execute on your local machine arbitrary code present in the model repository.
--disable-validation Whether to disable the validation of inference on neuron device compared to the outputs of
original PyTorch model on CPU.
--auto_cast {none,matmul,all}
Whether to cast operations from FP32 to lower precision to speed up the inference. Can be
`"none"`, `"matmul"` or `"all"`.
--auto_cast_type {bf16,fp16,tf32}
The data type to cast FP32 operations to when auto-cast mode is enabled. Can be `"bf16"`,
`"fp16"` or `"tf32"`.
--dynamic-batch-size Enable dynamic batch size for neuron compiled model. If this option is enabled, the input
batch size can be a multiple of the batch size during the compilation, but it comes with a
potential tradeoff in terms of latency.
Input shapes:
--batch_size BATCH_SIZE
Batch size that the Neuronx-cc compiler exported model will be able to take as input.
--sequence_length SEQUENCE_LENGTH
Sequence length that the Neuronx-cc compiler exported model will be able to take as input.
--num_choices NUM_CHOICES
Only for the multiple-choice task. Num choices that the Neuronx-cc compiler exported model
will be able to take as input.
--num_channels NUM_CHANNELS
Image tasks only. Number of channels that the Neuronx-cc compiler exported model will be
able to take as input.
--width WIDTH Image tasks only. Width that the Neuronx-cc compiler exported model will be able to take as
input.
--height HEIGHT Image tasks only. Height that the Neuronx-cc compiler exported model will be able to take
as input.
--num_images_per_prompt NUM_IMAGES_PER_PROMPT
Stable diffusion only. Number of images per prompt that the Neuronx-cc compiler exported
model will be able to take as input.
In the last section, you can see some input shape options to pass for exporting static neuron model, meaning that exact shape inputs should be used during the inference as given during compilation. If you are going to use variable-size inputs, you can pad your inputs to the shape used for compilation as a workaround. If you want the batch size to be dynamic, you can pass --dynamic-batch-size to enable dynamic batching, which means that you will be able to use inputs with difference batch size during inference, but it comes with a potential tradeoff in terms of latency.
You should see the following logs which validate the model on Neuron deivces by comparing with PyTorch model on CPU:
Copied
Validating Neuron model...
-[✓] Neuron model output names match reference model (last_hidden_state)
- Validating Neuron Model output "last_hidden_state":
-[✓] (1, 16, 32) matches (1, 16, 32)
-[✓] all values close (atol: 0.0001)
The Neuronx export succeeded and the exported model was saved at: distilbert_base_uncased_squad_neuron/
This exports a neuron-compiled TorchScript module of the checkpoint defined by the --model argument.
As you can see, the task was automatically detected. This was possible because the model was on the Hub. For local models, providing the --task argument is needed or it will default to the model architecture without any task specific head:
Note that providing the --task argument for a model on the Hub will disable the automatic task detection. The resulting model.neuron file, can then be loaded and run on Neuron devices.
Exporting a model to Neuron via NeuronModel
You will also be able to export your models to Neuron format with optimum.neuron.NeuronModelForXXX model classes. Here is an example:
Copied
>>> from optimum.neuron import NeuronModelForSequenceClassification
>>> input_shapes = {"batch_size": 1, "sequence_length": 64} # mandatory shapes
>>> model = NeuronModelForSequenceClassification.from_pretrained(
... "distilbert-base-uncased-finetuned-sst-2-english", export=True, **input_shapes
... )
# Save the model
>>> model.save_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")
And the exported model can be used for inference directly with the NeuronModelForXXX class:
Copied
>>> from transformers import AutoTokenizer
>>> from optimum.neuron import NeuronModelForSequenceClassification
>>> tokenizer = AutoTokenizer.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")
>>> model = NeuronModelForSequenceClassification.from_pretrained("./distilbert-base-uncased-finetuned-sst-2-english_neuron/")
>>> inputs = tokenizer("Hamilton is considered to be the best musical of human history.", return_tensors="pt")
>>> logits = model(**inputs).logits
>>> print(model.config.id2label[logits.argmax().item()])
'POSITIVE'
Exporting Stable Diffusion to Neuron
With the Optimum CLI you can compile components in the Stable Diffusion pipeline to gain acceleration on neuron devices during the inference.
So far, we support the export of following components in the pipeline:
CLIP text encoder
U-Net
VAE encoder
VAE decoder
“These blocks are chosen because they represent the bulk of the compute in the pipeline, and performance benchmarking has shown that running them on Neuron yields significant performance benefit.”
Besides, don’t hesitate to tweak the compilation configuration to find the best tradeoff between performance v.s accuracy in your use case. By default, we suggest casting FP32 matrix multiplication operations to BF16 which offers good performance with moderate sacrifice of the accuracy. Check out the guide from AWS Neuron documentation to better understand the options for your compilation.
Exporting a stable diffusion checkpoint can be done using the CLI:
Copied
optimum-cli export neuron --model stabilityai/stable-diffusion-2-1-base \
--task stable-diffusion \
--batch_size 1 \
--height 512 `# height in pixels of generated image, eg. 512, 768` \
--width 512 `# width in pixels of generated image, eg. 512, 768` \
--num_images_per_prompt 4 `# number of images to generate per prompt, defaults to 1` \
--auto_cast matmul `# cast only matrix multiplication operations` \
--auto_cast_type bf16 `# cast operations from FP32 to BF16` \
sd_neuron/
Exporting Stable Diffusion XL to Neuron
Similar to Stable Diffusion, you will be able to use Optimum CLI to compile components in the SDXL pipeline for inference on neuron devices.
We support the export of following components in the pipeline to boost the speed:
Text encoder
Second text encoder
U-Net (a three times larger UNet than the one in Stable Diffusion pipeline)
VAE encoder
VAE decoder
“Stable Diffusion XL works especially well with images between 768 and 1024.”
Exporting a SDXL checkpoint can be done using the CLI:
Copied
optimum-cli export neuron --model stabilityai/stable-diffusion-xl-base-1.0 \
--task stable-diffusion-xl \
--batch_size 1 \
--height 1024 `# height in pixels of generated image, eg. 768, 1024` \
--width 1024 `# width in pixels of generated image, eg. 768, 1024` \
--num_images_per_prompt 4 `# number of images to generate per prompt, defaults to 1` \
--auto_cast matmul `# cast only matrix multiplication operations` \
--auto_cast_type bf16 `# cast operations from FP32 to BF16` \
sd_neuron/
Selecting a task
Specifying a --task should not be necessary in most cases when exporting from a model on the BOINC AI Hub.
However, in case you need to check for a given a model architecture what tasks the Neuron export supports, we got you covered. First, you can check the list of supported tasks here.
For each model architecture, you can find the list of supported tasks via the ~exporters.tasks.TasksManager. For example, for DistilBERT, for the Neuron export, we have:
Copied
>>> from optimum.exporters.tasks import TasksManager
>>> from optimum.exporters.neuron.model_configs import * # Register neuron specific configs to the TasksManager
>>> distilbert_tasks = list(TasksManager.get_supported_tasks_for_model_type("distilbert", "neuron").keys())
>>> print(distilbert_tasks)
['feature-extraction', 'fill-mask', 'multiple-choice', 'question-answering', 'text-classification', 'token-classification']
You can then pass one of these tasks to the --task argument in the optimum-cli export neuron command, as mentioned above.