Optimization
Last updated
Last updated
π Optimum Intel provides an openvino
package that enables you to apply a variety of model compression methods such as quantization, pruning, on many models hosted on the π hub using the framework.
Post-training static quantization introduces an additional calibration step where data is fed through the network in order to compute the activations quantization parameters. Here is how to apply static quantization on a fine-tuned DistilBERT:
Copied
The quantize()
method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.
Apart from optimizing a model after training like post-training quantization above, optimum.openvino
also provides optimization methods during training, namely Quantization-Aware Training (QAT) and Joint Pruning, Quantization and Distillation (JPQD).
QAT simulates the effects of quantization during training, in order to alleviate its effects on the modelβs accuracy. It is recommended in the case where post-training quantization results in high accuracy degradation. Here is an example on how to fine-tune a DistilBERT on the sst-2 task while applying quantization aware training (QAT).
Copied
Other than quantization, compression methods like pruning and distillation are common in further improving the task performance and efficiency. Structured pruning slims a model for lower computational demands while distillation leverages knowledge of a teacher, usually, larger model to improve model prediction. Combining these methods with quantization can result in optimized model with significant efficiency improvement while enjoying good task accuracy retention. In optimum.openvino
, OVTrainer
provides the capability to jointly prune, quantize and distill a model during training. Following is an example on how to perform the optimization on BERT-base for the sst-2 task.
Copied
Once we have the config ready, we can start develop the training pipeline like the snippet below. Since we are customizing joint compression with config above, notice that OVConfig
is initialized with config dictionary (JSON parsing to python dictionary is skipped for brevity). As for distillation, users are required to load the teacher model, it is just like a normal model loading with transformers API. OVTrainingArguments
extends transformersβ TrainingArguments
with distillation hyperparameters, i.e. distillation weightage and temperature for ease of use. The snippet below shows how we load a teacher model and create training arguments with OVTrainingArguments
. Subsequently, the teacher model, with the instantiated OVConfig
and OVTrainingArguments
are fed to OVTrainer
. Voila! that is all we need, the rest of the pipeline is identical to native transformers training.
Copied
Copied
First, we create a config dictionary to specify the target algorithms. As optimum.openvino
relies on NNCF as backend, the config format follows NNCF specifications (see ). In the example config below, we specify pruning and quantization in a list of compression with thier hyperparameters. The pruning method closely resembles the work of whereas the quantization refers to QAT. With this configuration, the model under optimization will be initialized with pruning and quantization operators at the beginning of the training.
Known limitation: Current structured pruning with movement sparsity only supports BERT, Wav2vec2 and Swin family of models. See for more information.
More on the description and how to configure movement sparsity, see NNCF documentation .
More on available algorithms in NNCF, see documentation .
For complete JPQD scripts, please refer to examples provided .
Quantization-Aware Training (QAT) and knowledge distillation can also be combined in order to optimize Stable Diffusion models while maintaining accuracy. For more details, take a look at this .
After applying quantization on our model, we can then easily load it with our OVModelFor<Task>
classes and perform inference with OpenVINO Runtime using the Transformers .