How to apply graph optimization

Optimization

🌍 Optimum provides an optimum.onnxruntime package that enables you to apply graph optimization on many model hosted on the 🌍 hub using the ONNX Runtime model optimization tool.

Optimizing a model during the ONNX export

The ONNX model can be directly optimized during the ONNX export using Optimum CLI, by passing the argument --optimize {O1,O2,O3,O4} in the CLI, for example:

Copied

optimum-cli export onnx --model gpt2 --optimize O3 gpt2_onnx/

The optimization levels are:

  • O1: basic general optimizations.

  • O2: basic and extended general optimizations, transformers-specific fusions.

  • O3: same as O2 with GELU approximation.

  • O4: same as O3 with mixed precision (fp16, GPU-only, requires --device cuda).

Optimizing a model programmatically with ORTOptimizer

ONNX models can be optimized with the ORTOptimizer. The class can be initialized using the from_pretrained() method, which supports different checkpoint formats.

  1. Using an already initialized ORTModel class.

Copied

  1. Using a local ONNX model from a directory.

Copied

Optimization Configuration

The OptimizationConfig class allows to specify how the optimization should be performed by the ORTOptimizer.

In the optimization configuration, there are 4 possible optimization levels:

  • optimization_level=0: to disable all optimizations

  • optimization_level=1: to enable basic optimizations such as constant folding or redundant node eliminations

  • optimization_level=2: to enable extended graph optimizations such as node fusions

  • optimization_level=99: to enable data layout optimizations

Choosing a level enables the optimizations of that level, as well as the optimizations of all preceding levels. More information here.

enable_transformers_specific_optimizations=True means that transformers-specific graph fusion and approximation are performed in addition to the ONNX Runtime optimizations described above. Here is a list of the possible optimizations you can enable:

  • Gelu fusion with disable_gelu_fusion=False,

  • Layer Normalization fusion with disable_layer_norm_fusion=False,

  • Attention fusion with disable_attention_fusion=False,

  • SkipLayerNormalization fusion with disable_skip_layer_norm_fusion=False,

  • Add Bias and SkipLayerNormalization fusion with disable_bias_skip_layer_norm_fusion=False,

  • Add Bias and Gelu / FastGelu fusion with disable_bias_gelu_fusion=False,

  • Gelu approximation with enable_gelu_approximation=True.

While OptimizationConfig gives you full control on how to do optimization, it can be hard to know what to enable / disable. Instead, you can use AutoOptimizationConfig which provides four common optimization levels:

  • O1: basic general optimizations.

  • O2: basic and extended general optimizations, transformers-specific fusions.

  • O3: same as O2 with GELU approximation.

  • O4: same as O3 with mixed precision (fp16, GPU-only).

Example: Loading a O2 OptimizationConfig

Copied

You can also specify custom argument that were not defined in the O2 configuration, for instance:

Copied

Optimization examples

Below you will find an easy end-to-end example on how to optimize distilbert-base-uncased-finetuned-sst-2-english.

Copied

Below you will find an easy end-to-end example on how to optimize a Seq2Seq model sshleifer/distilbart-cnn-12-6”.

Copied

Optimizing a model with Optimum CLI

The Optimum ONNX Runtime optimization tools can be used directly through Optimum command-line interface:

Copied

Optimizing an ONNX model can be done as follows:

Copied

This optimizes all the ONNX files in onnx_model_location with the basic general optimizations.

Last updated