How to apply dynamic and static quantization

Quantization

🌍 Optimum provides an optimum.onnxruntime package that enables you to apply quantization on many models hosted on the BOINC AI Hub using the ONNX Runtime quantization tool.

The quantization process is abstracted via the ORTConfig and the ORTQuantizer classes. The former allows you to specify how quantization should be done, while the latter effectively handles quantization.

You can read the conceptual guide on quantization to learn about quantization. It explains the main concepts that you will be using when performing quantization with the ORTQuantizer.

Quantizing a model to be used with Optimum’s CLI

The Optimum ONNX Runtime quantization tool can be used through Optimum command-line interface:

Copied

optimum-cli onnxruntime quantize --help
usage: optimum-cli <command> [<args>] onnxruntime quantize [-h] --onnx_model ONNX_MODEL -o OUTPUT [--per_channel] (--arm64 | --avx2 | --avx512 | --avx512_vnni | --tensorrt | -c CONFIG)

options:
  -h, --help            show this help message and exit
  --arm64               Quantization for the ARM64 architecture.
  --avx2                Quantization with AVX-2 instructions.
  --avx512              Quantization with AVX-512 instructions.
  --avx512_vnni         Quantization with AVX-512 and VNNI instructions.
  --tensorrt            Quantization for NVIDIA TensorRT optimizer.
  -c CONFIG, --config CONFIG
                        `ORTConfig` file to use to optimize the model.

Required arguments:
  --onnx_model ONNX_MODEL
                        Path to the repository where the ONNX models to quantize are located.
  -o OUTPUT, --output OUTPUT
                        Path to the directory where to store generated ONNX model.

Optional arguments:
  --per_channel         Compute the quantization parameters on a per-channel basis.

Quantizing an ONNX model can be done as follows:

Copied

This quantize all the ONNX files in onnx_model_location with the AVX-512 instructions.

Creating an ORTQuantizer

The ORTQuantizer class is used to quantize your ONNX model. The class can be initialized using the from_pretrained() method, which supports different checkpoint formats.

  1. Using an already initialized ORTModelForXXX class.

Copied

  1. Using a local ONNX model from a directory.

Copied

Apply Dynamic Quantization

The ORTQuantizer class can be used to quantize dynamically your ONNX model. Below you will find an easy end-to-end example on how to quantize dynamically distilbert-base-uncased-finetuned-sst-2-english.

Copied

Static Quantization example

The ORTQuantizer class can be used to quantize statically your ONNX model. Below you will find an easy end-to-end example on how to quantize statically distilbert-base-uncased-finetuned-sst-2-english.

Copied

Quantize Seq2Seq models

The ORTQuantizer class currently doesn’t support multi-file models, like ORTModelForSeq2SeqLM. If you want to quantize a Seq2Seq model, you have to quantize each model’s component individually.

Currently, only dynamic quantization is supported for Seq2Seq models.

  1. Load seq2seq model as ORTModelForSeq2SeqLM.

Copied

  1. Define Quantizer for encoder, decoder and decoder with past keys

Copied

  1. Quantize all models

Copied

Last updated