Optimum
  • 🌍OVERVIEW
    • Optimum
    • Installation
    • Quick tour
    • Notebooks
    • 🌍CONCEPTUAL GUIDES
      • Quantization
  • 🌍HABANA
    • BOINC AI Optimum Habana
    • Installation
    • Quickstart
    • 🌍TUTORIALS
      • Overview
      • Single-HPU Training
      • Distributed Training
      • Run Inference
      • Stable Diffusion
      • LDM3D
    • 🌍HOW-TO GUIDES
      • Overview
      • Pretraining Transformers
      • Accelerating Training
      • Accelerating Inference
      • How to use DeepSpeed
      • Multi-node Training
    • 🌍CONCEPTUAL GUIDES
      • What are Habana's Gaudi and HPUs?
    • 🌍REFERENCE
      • Gaudi Trainer
      • Gaudi Configuration
      • Gaudi Stable Diffusion Pipeline
      • Distributed Runner
  • 🌍INTEL
    • BOINC AI Optimum Intel
    • Installation
    • 🌍NEURAL COMPRESSOR
      • Optimization
      • Distributed Training
      • Reference
    • 🌍OPENVINO
      • Models for inference
      • Optimization
      • Reference
  • 🌍AWS TRAINIUM/INFERENTIA
    • BOINC AI Optimum Neuron
  • 🌍FURIOSA
    • BOINC AI Optimum Furiosa
    • Installation
    • 🌍HOW-TO GUIDES
      • Overview
      • Modeling
      • Quantization
    • 🌍REFERENCE
      • Models
      • Configuration
      • Quantization
  • 🌍ONNX RUNTIME
    • Overview
    • Quick tour
    • 🌍HOW-TO GUIDES
      • Inference pipelines
      • Models for inference
      • How to apply graph optimization
      • How to apply dynamic and static quantization
      • How to accelerate training
      • Accelerated inference on NVIDIA GPUs
    • 🌍CONCEPTUAL GUIDES
      • ONNX And ONNX Runtime
    • 🌍REFERENCE
      • ONNX Runtime Models
      • Configuration
      • Optimization
      • Quantization
      • Trainer
  • 🌍EXPORTERS
    • Overview
    • The TasksManager
    • 🌍ONNX
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to ONNX
        • Add support for exporting an architecture to ONNX
      • 🌍REFERENCE
        • ONNX configurations
        • Export functions
    • 🌍TFLITE
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to TFLite
        • Add support for exporting an architecture to TFLite
      • 🌍REFERENCE
        • TFLite configurations
        • Export functions
  • 🌍TORCH FX
    • Overview
    • 🌍HOW-TO GUIDES
      • Optimization
    • 🌍CONCEPTUAL GUIDES
      • Symbolic tracer
    • 🌍REFERENCE
      • Optimization
  • 🌍BETTERTRANSFORMER
    • Overview
    • 🌍TUTORIALS
      • Convert Transformers models to use BetterTransformer
      • How to add support for new architectures?
  • 🌍LLM QUANTIZATION
    • GPTQ quantization
  • 🌍UTILITIES
    • Dummy input generators
    • Normalized configurations
Powered by GitBook
On this page
  • Optimization
  • Optimizing a model during the ONNX export
  • Optimizing a model programmatically with ORTOptimizer
  • Optimizing a model with Optimum CLI
  1. ONNX RUNTIME
  2. HOW-TO GUIDES

How to apply graph optimization

PreviousModels for inferenceNextHow to apply dynamic and static quantization

Last updated 1 year ago

Optimization

🌍 Optimum provides an optimum.onnxruntime package that enables you to apply graph optimization on many model hosted on the 🌍 hub using the model optimization tool.

Optimizing a model during the ONNX export

The ONNX model can be directly optimized during the ONNX export using Optimum CLI, by passing the argument --optimize {O1,O2,O3,O4} in the CLI, for example:

Copied

optimum-cli export onnx --model gpt2 --optimize O3 gpt2_onnx/

The optimization levels are:

  • O1: basic general optimizations.

  • O2: basic and extended general optimizations, transformers-specific fusions.

  • O3: same as O2 with GELU approximation.

  • O4: same as O3 with mixed precision (fp16, GPU-only, requires --device cuda).

Optimizing a model programmatically with ORTOptimizer

ONNX models can be optimized with the . The class can be initialized using the method, which supports different checkpoint formats.

  1. Using an already initialized class.

Copied

>>> from optimum.onnxruntime import ORTOptimizer, ORTModelForSequenceClassification

# Loading ONNX Model from the Hub
>>> model = ORTModelForSequenceClassification.from_pretrained(
...     "optimum/distilbert-base-uncased-finetuned-sst-2-english"
... )

# Create an optimizer from an ORTModelForXXX
>>> optimizer = ORTOptimizer.from_pretrained(model)
  1. Using a local ONNX model from a directory.

Copied

>>> from optimum.onnxruntime import ORTOptimizer

# This assumes a model.onnx exists in path/to/model
>>> optimizer = ORTOptimizer.from_pretrained("path/to/model")

Optimization Configuration

In the optimization configuration, there are 4 possible optimization levels:

  • optimization_level=0: to disable all optimizations

  • optimization_level=1: to enable basic optimizations such as constant folding or redundant node eliminations

  • optimization_level=2: to enable extended graph optimizations such as node fusions

  • optimization_level=99: to enable data layout optimizations

enable_transformers_specific_optimizations=True means that transformers-specific graph fusion and approximation are performed in addition to the ONNX Runtime optimizations described above. Here is a list of the possible optimizations you can enable:

  • Gelu fusion with disable_gelu_fusion=False,

  • Layer Normalization fusion with disable_layer_norm_fusion=False,

  • Attention fusion with disable_attention_fusion=False,

  • SkipLayerNormalization fusion with disable_skip_layer_norm_fusion=False,

  • Add Bias and SkipLayerNormalization fusion with disable_bias_skip_layer_norm_fusion=False,

  • Add Bias and Gelu / FastGelu fusion with disable_bias_gelu_fusion=False,

  • Gelu approximation with enable_gelu_approximation=True.

  • O1: basic general optimizations.

  • O2: basic and extended general optimizations, transformers-specific fusions.

  • O3: same as O2 with GELU approximation.

  • O4: same as O3 with mixed precision (fp16, GPU-only).

Copied

>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2()

You can also specify custom argument that were not defined in the O2 configuration, for instance:

Copied

>>> from optimum.onnxruntime import AutoOptimizationConfig
>>> optimization_config = AutoOptimizationConfig.O2(disable_embed_layer_norm_fusion=False)

Optimization examples

Copied

>>> from optimum.onnxruntime import (
...     AutoOptimizationConfig, ORTOptimizer, ORTModelForSequenceClassification
... )

>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_dir = "distilbert_optimized"

>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = AutoOptimizationConfig.O2()

>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)

Copied

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import  OptimizationConfig, ORTOptimizer, ORTModelForSeq2SeqLM

>>> model_id = "sshleifer/distilbart-cnn-12-6"
>>> save_dir = "distilbart_optimized"

>>> # Load a PyTorch model and export it to the ONNX format
>>> model = ORTModelForSeq2SeqLM.from_pretrained(model_id, export=True)

>>> # Create the optimizer
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> # Define the optimization strategy by creating the appropriate configuration
>>> optimization_config = OptimizationConfig(
...     optimization_level=2,
...     enable_transformers_specific_optimizations=True,
...     optimize_for_gpu=False,
... )

>>> # Optimize the model
>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> optimized_model = ORTModelForSeq2SeqLM.from_pretrained(save_dir)
>>> tokens = tokenizer("This is a sample input", return_tensors="pt")
>>> outputs = optimized_model.generate(**tokens)

Optimizing a model with Optimum CLI

The Optimum ONNX Runtime optimization tools can be used directly through Optimum command-line interface:

Copied

optimum-cli onnxruntime optimize --help
usage: optimum-cli <command> [<args>] onnxruntime optimize [-h] --onnx_model ONNX_MODEL -o OUTPUT (-O1 | -O2 | -O3 | -O4 | -c CONFIG)

options:
  -h, --help            show this help message and exit
  -O1                   Basic general optimizations (see: https://boincai.com/docs/optimum/onnxruntime/usage_guides/optimization for more details).
  -O2                   Basic and extended general optimizations, transformers-specific fusions (see: https://boincai.com/docs/optimum/onnxruntime/usage_guides/optimization for more
                        details).
  -O3                   Same as O2 with Gelu approximation (see: https://boincai.com/docs/optimum/onnxruntime/usage_guides/optimization for more details).
  -O4                   Same as O3 with mixed precision (see: https://boincai.com/docs/optimum/onnxruntime/usage_guides/optimization for more details).
  -c CONFIG, --config CONFIG
                        `ORTConfig` file to use to optimize the model.

Required arguments:
  --onnx_model ONNX_MODEL
                        Path to the repository where the ONNX models to optimize are located.
  -o OUTPUT, --output OUTPUT
                        Path to the directory where to store generated ONNX model.

Optimizing an ONNX model can be done as follows:

Copied

 optimum-cli onnxruntime optimize --onnx_model onnx_model_location/ -O1 -o optimized_model/

This optimizes all the ONNX files in onnx_model_location with the basic general optimizations.

The class allows to specify how the optimization should be performed by the .

Choosing a level enables the optimizations of that level, as well as the optimizations of all preceding levels. More information .

While gives you full control on how to do optimization, it can be hard to know what to enable / disable. Instead, you can use which provides four common optimization levels:

Example: Loading a O2

Below you will find an easy end-to-end example on how to optimize .

Below you will find an easy end-to-end example on how to optimize a Seq2Seq model .

🌍
🌍
ONNX Runtime
ORTOptimizer
from_pretrained()
ORTModel
OptimizationConfig
ORTOptimizer
here
OptimizationConfig
AutoOptimizationConfig
OptimizationConfig
distilbert-base-uncased-finetuned-sst-2-english
sshleifer/distilbart-cnn-12-6”