Quick tour

Quickstart

At its core, 🌍 Optimum uses configuration objects to define parameters for optimization on different accelerators. These objects are then used to instantiate dedicated optimizers, quantizers, and pruners.

Before applying quantization or optimization, we first need to export our model to the ONNX format.

Copied

>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import AutoTokenizer

>>> model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_directory = "tmp/onnx/"
>>> # Load a model from transformers and export it to ONNX
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
>>> # Save the onnx model and tokenizer
>>> ort_model.save_pretrained(save_directory)
>>> tokenizer.save_pretrained(save_directory)

Let’s see now how we can apply dynamic quantization with ONNX Runtime:

Copied

>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig
>>> from optimum.onnxruntime import ORTQuantizer
>>> # Define the quantization methodology
>>> qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
>>> quantizer = ORTQuantizer.from_pretrained(ort_model)
>>> # Apply dynamic quantization on the model
>>> quantizer.quantize(save_dir=save_directory, quantization_config=qconfig)

In this example, we’ve quantized a model from the BOINC AI Hub, but it could also be a path to a local model directory. The result from applying the quantize() method is a model_quantized.onnx file that can be used to run inference. Here’s an example of how to load an ONNX Runtime model and generate predictions with it:

Copied

Similarly, you can apply static quantization by simply setting is_static to True when instantiating the QuantizationConfig object:

Copied

Static quantization relies on feeding batches of data through the model to estimate the activation quantization parameters ahead of inference time. To support this, 🌍 Optimum allows you to provide a calibration dataset. The calibration dataset can be a simple Dataset object from the 🌍 Datasets library, or any dataset that’s hosted on the BOINC AI Hub. For this example, we’ll pick the sst2 dataset that the model was originally trained on:

Copied

As a final example, let’s take a look at applying graph optimizations techniques such as operator fusion and constant folding. As before, we load a configuration object, but this time by setting the optimization level instead of the quantization approach:

Copied

Next, we load an optimizer to apply these optimisations to our model:

Copied

And that’s it - the model is now optimized and ready for inference! As you can see, the process is similar in each case:

  1. Define the optimization / quantization strategies via an OptimizationConfig / QuantizationConfig object

  2. Instantiate a ORTQuantizer or ORTOptimizer class

  3. Apply the quantize() or optimize() method

  4. Run inference

Check out the examples directory for more sophisticated usage.

Happy optimising 🌍!

Last updated