How to apply graph optimization
Optimization
🌍 Optimum provides an optimum.onnxruntime
package that enables you to apply graph optimization on many model hosted on the 🌍 hub using the ONNX Runtime model optimization tool.
Optimizing a model during the ONNX export
The ONNX model can be directly optimized during the ONNX export using Optimum CLI, by passing the argument --optimize {O1,O2,O3,O4}
in the CLI, for example:
Copied
The optimization levels are:
O1: basic general optimizations.
O2: basic and extended general optimizations, transformers-specific fusions.
O3: same as O2 with GELU approximation.
O4: same as O3 with mixed precision (fp16, GPU-only, requires
--device cuda
).
Optimizing a model programmatically with ORTOptimizer
ONNX models can be optimized with the ORTOptimizer. The class can be initialized using the from_pretrained() method, which supports different checkpoint formats.
Using an already initialized ORTModel class.
Copied
Using a local ONNX model from a directory.
Copied
Optimization Configuration
The OptimizationConfig class allows to specify how the optimization should be performed by the ORTOptimizer.
In the optimization configuration, there are 4 possible optimization levels:
optimization_level=0
: to disable all optimizationsoptimization_level=1
: to enable basic optimizations such as constant folding or redundant node eliminationsoptimization_level=2
: to enable extended graph optimizations such as node fusionsoptimization_level=99
: to enable data layout optimizations
Choosing a level enables the optimizations of that level, as well as the optimizations of all preceding levels. More information here.
enable_transformers_specific_optimizations=True
means that transformers
-specific graph fusion and approximation are performed in addition to the ONNX Runtime optimizations described above. Here is a list of the possible optimizations you can enable:
Gelu fusion with
disable_gelu_fusion=False
,Layer Normalization fusion with
disable_layer_norm_fusion=False
,Attention fusion with
disable_attention_fusion=False
,SkipLayerNormalization fusion with
disable_skip_layer_norm_fusion=False
,Add Bias and SkipLayerNormalization fusion with
disable_bias_skip_layer_norm_fusion=False
,Add Bias and Gelu / FastGelu fusion with
disable_bias_gelu_fusion=False
,Gelu approximation with
enable_gelu_approximation=True
.
While OptimizationConfig gives you full control on how to do optimization, it can be hard to know what to enable / disable. Instead, you can use AutoOptimizationConfig which provides four common optimization levels:
O1: basic general optimizations.
O2: basic and extended general optimizations, transformers-specific fusions.
O3: same as O2 with GELU approximation.
O4: same as O3 with mixed precision (fp16, GPU-only).
Example: Loading a O2 OptimizationConfig
Copied
You can also specify custom argument that were not defined in the O2 configuration, for instance:
Copied
Optimization examples
Below you will find an easy end-to-end example on how to optimize distilbert-base-uncased-finetuned-sst-2-english.
Copied
Below you will find an easy end-to-end example on how to optimize a Seq2Seq model sshleifer/distilbart-cnn-12-6”.
Copied
Optimizing a model with Optimum CLI
The Optimum ONNX Runtime optimization tools can be used directly through Optimum command-line interface:
Copied
Optimizing an ONNX model can be done as follows:
Copied
This optimizes all the ONNX files in onnx_model_location
with the basic general optimizations.
Last updated