# How to apply graph optimization

## Optimization

🌍 Optimum provides an `optimum.onnxruntime` package that enables you to apply graph optimization on many model hosted on the 🌍 hub using the [ONNX Runtime](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) model optimization tool.

### Optimizing a model during the ONNX export

The ONNX model can be directly optimized during the ONNX export using Optimum CLI, by passing the argument `--optimize {O1,O2,O3,O4}` in the CLI, for example:

Copied

```
optimum-cli export onnx --model gpt2 --optimize O3 gpt2_onnx/
```

The optimization levels are:

* O1: basic general optimizations.
* O2: basic and extended general optimizations, transformers-specific fusions.
* O3: same as O2 with GELU approximation.
* O4: same as O3 with mixed precision (fp16, GPU-only, requires `--device cuda`).

### Optimizing a model programmatically with ORTOptimizer

ONNX models can be optimized with the [ORTOptimizer](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/optimization#optimum.onnxruntime.ORTOptimizer). The class can be initialized using the [from\_pretrained()](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/optimization#optimum.onnxruntime.ORTOptimizer.from_pretrained) method, which supports different checkpoint formats.

1. Using an already initialized [ORTModel](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModel) class.

Copied

```
>>> from optimum.onnxruntime import ORTOptimizer, ORTModelForSequenceClassification

# Loading ONNX Model from the Hub
>>> model = ORTModelForSequenceClassification.from_pretrained(
...     "optimum/distilbert-base-uncased-finetuned-sst-2-english"
... )

# Create an optimizer from an ORTModelForXXX
>>> optimizer = ORTOptimizer.from_pretrained(model)
```

2. Using a local ONNX model from a directory.

Copied

```
>>> from optimum.onnxruntime import ORTOptimizer

# This assumes a model.onnx exists in path/to/model
>>> optimizer = ORTOptimizer.from_pretrained("path/to/model")
```

#### Optimization Configuration

The [OptimizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.OptimizationConfig) class allows to specify how the optimization should be performed by the [ORTOptimizer](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/optimization#optimum.onnxruntime.ORTOptimizer).

In the optimization configuration, there are 4 possible optimization levels:

* `optimization_level=0`: to disable all optimizations
* `optimization_level=1`: to enable basic optimizations such as constant folding or redundant node eliminations
* `optimization_level=2`: to enable extended graph optimizations such as node fusions
* `optimization_level=99`: to enable data layout optimizations

Choosing a level enables the optimizations of that level, as well as the optimizations of all preceding levels. More information [here](https://onnxruntime.ai/docs/performance/graph-optimizations.html).

`enable_transformers_specific_optimizations=True` means that `transformers`-specific graph fusion and approximation are performed in addition to the ONNX Runtime optimizations described above. Here is a list of the possible optimizations you can enable:

* Gelu fusion with `disable_gelu_fusion=False`,
* Layer Normalization fusion with `disable_layer_norm_fusion=False`,
* Attention fusion with `disable_attention_fusion=False`,
* SkipLayerNormalization fusion with `disable_skip_layer_norm_fusion=False`,
* Add Bias and SkipLayerNormalization fusion with `disable_bias_skip_layer_norm_fusion=False`,
* Add Bias and Gelu / FastGelu fusion with `disable_bias_gelu_fusion=False`,
* Gelu approximation with `enable_gelu_approximation=True`.

While [OptimizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.OptimizationConfig) gives you full control on how to do optimization, it can be hard to know what to enable / disable. Instead, you can use [AutoOptimizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.AutoOptimizationConfig) which provides four common optimization levels:

* O1: basic general optimizations.
* O2: basic and extended general optimizations, transformers-specific fusions.
* O3: same as O2 with GELU approximation.
* O4: same as O3 with mixed precision (fp16, GPU-only).

Example: Loading a O2 [OptimizationConfig](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/configuration#optimum.onnxruntime.OptimizationConfig)

Copied

```
>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2()
```

You can also specify custom argument that were not defined in the O2 configuration, for instance:

Copied

```
>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2(disable_embed_layer_norm_fusion=False)
```

#### Optimization examples

Below you will find an easy end-to-end example on how to optimize [distilbert-base-uncased-finetuned-sst-2-english](https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).

Copied

```
>>> from optimum.onnxruntime import (
...     AutoOptimizationConfig, ORTOptimizer, ORTModelForSequenceClassification
... )

>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_dir = "distilbert_optimized"

>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = AutoOptimizationConfig.O2()

>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
```

Below you will find an easy end-to-end example on how to optimize a Seq2Seq model [sshleifer/distilbart-cnn-12-6”](https://huggingface.co/sshleifer/distilbart-cnn-12-6).

Copied

```
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import  OptimizationConfig, ORTOptimizer, ORTModelForSeq2SeqLM

>>> model_id = "sshleifer/distilbart-cnn-12-6"
>>> save_dir = "distilbart_optimized"

>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSeq2SeqLM.from_pretrained(model_id, export=True)

>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = OptimizationConfig(
...     optimization_level=2,
...     enable_transformers_specific_optimizations=True,
...     optimize_for_gpu=False,
... )

>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> optimized_model = ORTModelForSeq2SeqLM.from_pretrained(save_dir)
>>> tokens = tokenizer("This is a sample input", return_tensors="pt")
>>> outputs = optimized_model.generate(**tokens)
```

### Optimizing a model with Optimum CLI

The Optimum ONNX Runtime optimization tools can be used directly through Optimum command-line interface:

Copied

```
optimum-cli onnxruntime optimize --help
usage: optimum-cli <command> [<args>] onnxruntime optimize [-h] --onnx_model ONNX_MODEL -o OUTPUT (-O1 | -O2 | -O3 | -O4 | -c CONFIG)

options:
  -h, --help            show this help message and exit
  -O1                   Basic general optimizations (see: https://boincai.com/docs/optimum/onnxruntime/usage_guides/optimization for more details).
  -O2                   Basic and extended general optimizations, transformers-specific fusions (see: https://boincai.com/docs/optimum/onnxruntime/usage_guides/optimization for more
                        details).
  -O3                   Same as O2 with Gelu approximation (see: https://boincai.com/docs/optimum/onnxruntime/usage_guides/optimization for more details).
  -O4                   Same as O3 with mixed precision (see: https://boincai.com/docs/optimum/onnxruntime/usage_guides/optimization for more details).
  -c CONFIG, --config CONFIG
                        `ORTConfig` file to use to optimize the model.

Required arguments:
  --onnx_model ONNX_MODEL
                        Path to the repository where the ONNX models to optimize are located.
  -o OUTPUT, --output OUTPUT
                        Path to the directory where to store generated ONNX model.
```

Optimizing an ONNX model can be done as follows:

Copied

```
 optimum-cli onnxruntime optimize --onnx_model onnx_model_location/ -O1 -o optimized_model/
```

This optimizes all the ONNX files in `onnx_model_location` with the basic general optimizations.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/optimum/onnx-runtime/how-to-guides/how-to-apply-graph-optimization.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
