# Neuron models for inference

## Neuron Model Inference

*The APIs presented in the following documentation are relevant for the inference on* [*inf2*](https://aws.amazon.com/ec2/instance-types/inf2/)*,* [*trn1*](https://aws.amazon.com/ec2/instance-types/trn1/) *and* [*inf1*](https://aws.amazon.com/ec2/instance-types/inf1/)*.*

`NeuronModelForXXX` classes help to load models from the [BOINC AI Hub](https://huggingface.co/docs/optimum-neuron/guides/hf.co/models) and compile them to a serialized format optimized for neuron devices. You will then be able to load the model and run inference with the acceleration powered by AWS Neuron devices.

### Switching from Transformers to Optimum

The `optimum.neuron.NeuronModelForXXX` model classes are APIs compatible with BOINC AI Transformers models. This means seamless integration with BOINC AI’s ecosystem. You can just replace your `AutoModelForXXX` class with the corresponding `NeuronModelForXXX` class in `optimum.neuron`.

If you already use Transformers, you will be able to reuse your code just by replacing model classes:

Copied

```
from transformers import AutoTokenizer
-from transformers import AutoModelForSequenceClassification
+from optimum.neuron import NeuronModelForSequenceClassification

# PyTorch checkpoint
-model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

+model = NeuronModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english",
+                                                             export=True, **neuron_kwargs)
```

As shown above, when you use `NeuronModelForXXX` for the first time, you will need to set `export=True` to compile your model from PyTorch to a neuron-compatible format.

You will also need to pass Neuron specific parameters to configure the export. Each model architecture has its own set of parameters, as detailed in the next paragraphs.

Once your model has been exported, you can save it either on your local or in the [BOINC AI Model Hub](https://hf.co/models):

Copied

```
# Save the neuron model
>>> model.save_pretrained("a_local_path_for_compiled_neuron_model")

# Push the neuron model to BA Hub
>>> model.push_to_hub(
...     "a_local_path_for_compiled_neuron_model", repository_id="my-neuron-repo", use_auth_token=True
... )
```

And the next time when you want to run inference, just load your compiled model which will save you the compilation time:

Copied

```
>>> from optimum.neuron import NeuronModelForSequenceClassification
>>> model = NeuronModelForSequenceClassification.from_pretrained("my-neuron-repo")
```

As you see, there is no need to pass the neuron arguments used during the export as they are saved in a `config.json` file, and will be restored automatically by `NeuronModelForXXX` class.

When running inference for the first time, there is a warmup phase when you run the pipeline for the first time. This run would take 3x-4x higher latency than a regular run.

### Discriminative NLP models

As explained in the previous section, you will need only few modifications to your Transformers code to export and run NLP models:

Copied

```
from transformers import AutoTokenizer
-from transformers import AutoModelForSequenceClassification
+from optimum.neuron import NeuronModelForSequenceClassification

# PyTorch checkpoint
-model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Compile your model during the first time
+compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
+input_shapes = {"batch_size": 1, "sequence_length": 64}
+model = NeuronModelForSequenceClassification.from_pretrained(
+    "distilbert-base-uncased-finetuned-sst-2-english", export=True, **compiler_args, **input_shapes,
+)

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
inputs = tokenizer("Hamilton is considered to be the best musical of human history.", return_tensors="pt")

logits = model(**inputs).logits
print(model.config.id2label[logits.argmax().item()])
# 'POSITIVE'
```

`compiler_args` are optional arguments for the compiler, these arguments usually control how the compiler makes tradeoff between the inference performance (latency and throughput) and the accuracy. Here we cast FP32 operations to BF16 using the Neuron matrix-multiplication engine.

`input_shapes` are mandatory static shape information that you need to send to the neuron compiler. Wondering what shapes are mandatory for your model? Check it out with the following code:

Copied

```
>>> from transformers import AutoModelForSequenceClassification
>>> from optimum.exporters import TasksManager

>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")

# Infer the task name if you don't know
>>> task = TasksManager.infer_task_from_model(model)  # 'text-classification'

>>> neuron_config_constructor = TasksManager.get_exporter_config_constructor(
...     model=model, exporter="neuron", task='text-classification'
... )
>>> print(neuron_config_constructor.func.get_mandatory_axes_for_task(task))
# ('batch_size', 'sequence_length')
```

Be careful, the input shapes used for compilation should be inferior than the size of inputs that you will feed into the model during the inference.

* What if input sizes are smaller than compilation input shapes?

No worries, `NeuronModelForXXX` class will pad your inputs to an eligible shape. Besides you can set `dynamic_batch_size=True` in the `from_pretrained` method to enable dynamic batching, which means that your inputs can have variable batch size.

*(Just keep in mind: dynamicity and padding comes with not only flexibility but also performance drop. Fair enough!)*

### Generative NLP models

As explained before, you will need only a few modifications to your Transformers code to export and run NLP models:

#### Configuring the export of a generative model

As for non-generative models, two sets of parameters can be passed to the `from_pretrained()` method to configure how a transformers checkpoint is exported to a neuron optimized model:

* `compiler_args = { num_cores, auto_cast_type }` are optional arguments for the compiler, these arguments usually control how the compiler makes tradeoff between the inference latency and throughput and the accuracy.
* `input_shapes = { batch_size, sequence_length }` correspond to the static shape of the model input and the KV-cache (attention keys and values for past tokens).
* `num_cores` is the number of neuron cores used when instantiating the model. Each neuron core has 16 Gb of memory, which means that bigger models need to be split on multiple cores. Defaults to 1,
* `auto_cast_type` specifies the format to encode the weights. It can be one of `fp32` (`float32`), `fp16` (`float16`) or `bf16` (`bfloat16`). Defaults to `fp32`.
* `batch_size` is the number of input sequences that the model will accept. Defaults to 1,
* `sequence_length` is the maximum number of tokens in an input sequence. Defaults to `max_position_embeddings` (`n_positions` for older models).

Copied

```
from transformers import AutoTokenizer
-from transformers import AutoModelForCausalLM
+from optimum.neuron import NeuronModelForCausalLM

# Instantiate and convert to Neuron a PyTorch checkpoint
+compiler_args = {"num_cores": 1, "auto_cast_type": 'fp32'}
+input_shapes = {"batch_size": 1, "sequence_length": 512}
-model = AutoModelForCausalLM.from_pretrained("gpt2")
+model = NeuronModelForCausalLM.from_pretrained("gpt2", export=True, **compiler_args, **input_shapes)
```

As explained before, these parameters can only be configured during export. This means in particular that during inference:

* the `batch_size` of the inputs should be equal to the `batch_size` used during export,
* the `length` of the input sequences should be lower than the `sequence_length` used during export,
* the maximum number of tokens (input + generated) cannot exceed the `sequence_length` used during export.

#### Text generation inference

As with the original transformers models, use `generate()` instead of `forward()` to generate text sequences.

Copied

```
from transformers import AutoTokenizer
-from transformers import AutoModelForCausalLM
+from optimum.neuron import NeuronModelForCausalLM

# Instantiate and convert to Neuron a PyTorch checkpoint
-model = AutoModelForCausalLM.from_pretrained("gpt2")
+model = NeuronModelForCausalLM.from_pretrained("gpt2", export=True)

tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token_id = tokenizer.eos_token_id

tokens = tokenizer("I really wish ", return_tensors="pt")
with torch.inference_mode():
    sample_output = model.generate(
        **tokens,
        do_sample=True,
        min_length=128,
        max_length=256,
        temperature=0.7,
    )
    outputs = [tokenizer.decode(tok) for tok in sample_output]
    print(outputs)
```

The generation is highly configurable. Please refer to [https://boincai.com/docs/transformers/generation\_strategies](https://huggingface.co/docs/transformers/generation_strategies) for details.

Please be aware that:

* for each model architecture, default values are provided for all parameters, but values passed to the `generate` method will take precedence,
* the generation parameters can be stored in a `generation_config.json` file. When such a file is present in model directory, it will be parsed to set the default parameters (the values passed to the `generate` method still take precedence).

### Stable Diffusion

Optimum extends 🌍`Diffusers` to support inference on Neuron. To get started, make sure you have installed Diffusers:

Copied

```
pip install "optimum[neuronx, diffusers]"
```

You can also accelerate the inference of stable diffusion on neuronx devices (inf2 / trn1). There are four components which need to be exported to the `.neuron` format to boost the performance:

* Text encoder
* U-Net
* VAE encoder
* VAE decoder

#### Text-to-Image

`NeuronStableDiffusionPipeline` class allows you to generate images from a text prompt on neuron devices similar to the experience with `diffusers`.

Like for other tasks, you need to compile models before being able to perform inference. The export can be done either via the CLI or via `NeuronStableDiffusionPipeline` API. Here is an example of exporting stable diffusion components with `NeuronStableDiffusionPipeline`:

To apply optimized compute of Unet’s attention score, please configure your environment variable with `export NEURON_FUSE_SOFTMAX=1`.

Besides, don’t hesitate to tweak the compilation configuration to find the best tradeoff between performance v.s accuracy in your use case. By default, we suggest casting FP32 matrix multiplication operations to BF16 which offers good performance with moderate sacrifice of the accuracy. Check out the guide from [AWS Neuron documentation](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/appnotes/neuronx-cc/neuronx-cc-training-mixed-precision.html#neuronx-cc-training-mixed-precision) to better understand the options for your compilation.

Copied

```
>>> from optimum.neuron import NeuronStableDiffusionPipeline

>>> model_id = "runwayml/stable-diffusion-v1-5"
>>> compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
>>> input_shapes = {"batch_size": 1, "height": 512, "width": 512}

>>> stable_diffusion = NeuronStableDiffusionPipeline.from_pretrained(model_id, export=True, **compiler_args, **input_shapes)

# Save locally or upload to the BOINC AI Hub
>>> save_directory = "sd_neuron/"
>>> stable_diffusion.save_pretrained(save_directory)
>>> stable_diffusion.push_to_hub(
...     save_directory, repository_id="my-neuron-repo", use_auth_token=True
... )
```

Now generate an image with a prompt on neuron:

Copied

```
>>> prompt = "a photo of an astronaut riding a horse on mars"
>>> image = stable_diffusion(prompt).images[0]
```

![stable diffusion generated image](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/guides/models/01-sd-image.png)

#### Image-to-Image

With the `NeuronStableDiffusionImg2ImgPipeline` class, you can generate a new image conditioned on a text prompt and an initial image.

Copied

```
import requests
from PIL import Image
from io import BytesIO
from optimum.neuron import NeuronStableDiffusionImg2ImgPipeline

model_id = "nitrosocke/Ghibli-Diffusion"
input_shapes = {"batch_size": 1, "height": 512, "width": 512}
pipeline = NeuronStableDiffusionImg2ImgPipeline.from_pretrained(model_id, export=True, **input_shapes, device_ids=[0, 1])
pipeline.save_pretrained("sd_img2img/")

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"

response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((512, 512))

prompt = "ghibli style, a fantasy landscape with snowcapped mountains, trees, lake with detailed reflection. sunlight and cloud in the sky, warm colors, 8K"

image = pipeline(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images[0]
image.save("fantasy_landscape.png")
```

|                                                               `image`                                                              |                                                         `prompt`                                                         |                                                         output                                                        |    |
| :--------------------------------------------------------------------------------------------------------------------------------: | :----------------------------------------------------------------------------------------------------------------------: | :-------------------------------------------------------------------------------------------------------------------: | -: |
| ![landscape photo](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/03-sd-img2img-init.png) | ***ghibli style, a fantasy landscape with snowcapped mountains, trees, lake with detailed reflection. warm colors, 8K*** | ![drawing](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/04-sd-img2img.png) |    |

#### Inpaint

With the `NeuronStableDiffusionInpaintPipeline` class, you can edit specific parts of an image by providing a mask and a text prompt.

Copied

```
import requests
from PIL import Image
from io import BytesIO
from optimum.neuron import NeuronStableDiffusionInpaintPipeline

model_id = "runwayml/stable-diffusion-inpainting"
input_shapes = {"batch_size": 1, "height": 512, "width": 512}
pipeline = NeuronStableDiffusionInpaintPipeline.from_pretrained(model_id, export=True, **input_shapes, device_ids=[0, 1])
pipeline.save_pretrained("sd_inpaint/")

def download_image(url):
    response = requests.get(url)
    return Image.open(BytesIO(response.content)).convert("RGB")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"

init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image).images[0]
image.save("cat_on_bench.png")
```

|                                                                 `image`                                                                 |                                                                 `mask_image`                                                                 |                               `prompt`                               |                                                                                                                output |
| :-------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------------------------------------------------------------------------------: | :------------------------------------------------------------------: | --------------------------------------------------------------------------------------------------------------------: |
| ![drawing](https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png) | ![drawing](https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png) | ***Face of a yellow cat, high resolution, sitting on a park bench*** | ![drawing](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/05-sd-inpaint.png) |

### Stable Diffusion XL

#### Text-to-Image

Similar to Stable Diffusion, you will be able to use `NeuronStableDiffusionXLPipeline` API to export and run inference on Neuron devices with SDXL models.

Copied

```
>>> from optimum.neuron import NeuronStableDiffusionXLPipeline

>>> model_id = "stabilityai/stable-diffusion-xl-base-1.0"
>>> compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16"}
>>> input_shapes = {"batch_size": 1, "height": 1024, "width": 1024}

>>> stable_diffusion_xl = NeuronStableDiffusionXLPipeline.from_pretrained(model_id, export=True, **compiler_args, **input_shapes)

# Save locally or upload to the BOINC AI Hub
>>> save_directory = "sd_neuron_xl/"
>>> stable_diffusion_xl.save_pretrained(save_directory)
>>> stable_diffusion_xl.push_to_hub(
...     save_directory, repository_id="my-neuron-repo", use_auth_token=True
... )
```

Now generate an image with a text prompt on neuron:

Copied

```
>>> prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
>>> image = stable_diffusion_xl(prompt).images[0]
```

![sdxl generated image](https://raw.githubusercontent.com/huggingface/optimum-neuron/main/docs/assets/guides/models/02-sdxl-image.jpeg)

#### Image-to-Image

With `NeuronStableDiffusionXLImg2ImgPipeline`, you can pass an initial image, and a text prompt to condition generated images:

Copied

```
from optimum.neuron import NeuronStableDiffusionXLImg2ImgPipeline
from diffusers.utils import load_image

prompt = "a dog running, lake, moat"
url = "https://boincai.com/datasets/optimum/documentation-images/resolve/main/intel/openvino/sd_xl/castle_friedrich.png"
init_image = load_image(url).convert("RGB")

pipe = NeuronStableDiffusionXLImg2ImgPipeline.from_pretrained("sd_neuron_xl/", device_ids=[0, 1])
image = pipe(prompt=prompt, image=init_image).images[0]
```

|                                                                `image`                                                               |             `prompt`            |                                                              output                                                             |    |
| :----------------------------------------------------------------------------------------------------------------------------------: | :-----------------------------: | :-----------------------------------------------------------------------------------------------------------------------------: | -: |
| ![castle photo](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/sd_xl/castle_friedrich.png) | ***a dog running, lake, moat*** | ![castle with dog](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/06-sdxl-img2img.png) |    |

#### Inpaint

With `NeuronStableDiffusionXLInpaintPipeline`, pass the original image and a mask of what you want to replace in the original image. Then replace the masked area with content described in a prompt.

Copied

```
from optimum.neuron import NeuronStableDiffusionXLInpaintPipeline
from diffusers.utils import load_image

img_url = "https://boincai.com/datasets/boincai/documentation-images/resolve/main/diffusers/sdxl-text2img.png"
mask_url = (
    "https://boincai.com/datasets/boincai/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png"
)

init_image = load_image(img_url).convert("RGB")
mask_image = load_image(mask_url).convert("RGB")
prompt = "A deep sea diver floating"

pipe = NeuronStableDiffusionXLInpaintPipeline.from_pretrained("sd_neuron_xl/", device_ids=[0, 1])
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0]
```

|                                                        `image`                                                        |                                                        `mask_image`                                                       |             `prompt`            |                                                                                                                  output |
| :-------------------------------------------------------------------------------------------------------------------: | :-----------------------------------------------------------------------------------------------------------------------: | :-----------------------------: | ----------------------------------------------------------------------------------------------------------------------: |
| ![drawing](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png) | ![drawing](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png) | ***A deep sea diver floating*** | ![drawing](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/07-sdxl-inpaint.png) |

#### Refine Image Quality

SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) to denoise low-noise stage images generated from the base model. There are two ways to use the refiner:

1. use the base and refiner model together to produce a refined image.
2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image.

**Base + refiner model**

Copied

```
from optimum.neuron import NeuronStableDiffusionXLPipeline, NeuronStableDiffusionXLImg2ImgPipeline

prompt = "A majestic lion jumping from a big stone at night"
base = NeuronStableDiffusionXLPipeline.from_pretrained("sd_neuron_xl/", device_ids=[0, 1])
image = base(
    prompt=prompt,
    num_images_per_prompt=num_images_per_prompt,
    num_inference_steps=40,
    denoising_end=0.8,
    output_type="latent",
).images[0]
del base  # To avoid neuron device OOM

refiner = NeuronStableDiffusionXLImg2ImgPipeline.from_pretrained("sd_neuron_xl_refiner/", device_ids=[0, 1])
image = image = refiner(
    prompt=prompt,
    num_inference_steps=40,
    denoising_start=0.8,
    image=image,
).images[0]
```

![sdxl base + refiner](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/08-sdxl-base-refine.png)

**Base to refiner model**

Copied

```
from optimum.neuron import NeuronStableDiffusionXLPipeline, NeuronStableDiffusionXLImg2ImgPipeline

prompt = "A majestic lion jumping from a big stone at night"
base = NeuronStableDiffusionXLPipeline.from_pretrained("sd_neuron_xl/", device_ids=[0, 1])
image = base(prompt=prompt, output_type="latent").images[0]
del base  # To avoid neuron device OOM

refiner = NeuronStableDiffusionXLImg2ImgPipeline.from_pretrained("sd_neuron_xl_refiner/", device_ids=[0, 1])
image = refiner(prompt=prompt, image=image[None, :]).images[0]
```

|                                                        `Base Image`                                                       |                                                                                                                     Refined Image |
| :-----------------------------------------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------------------------------------------------: |
| ![drawing](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/09-sdxl-base-full.png) | ![drawing](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/neuron/models/010-sdxl-refiner-detailed.png) |

To avoid Neuron device out of memory, it’s suggested to finish all base inference and release the device memory before running the refiner.

Happy inference with Neuron! 🚀


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/aws-trainium-and-inferentia/how-to-guides/neuron-models-for-inference.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
