Optimum
  • 🌍OVERVIEW
    • Optimum
    • Installation
    • Quick tour
    • Notebooks
    • 🌍CONCEPTUAL GUIDES
      • Quantization
  • 🌍HABANA
    • BOINC AI Optimum Habana
    • Installation
    • Quickstart
    • 🌍TUTORIALS
      • Overview
      • Single-HPU Training
      • Distributed Training
      • Run Inference
      • Stable Diffusion
      • LDM3D
    • 🌍HOW-TO GUIDES
      • Overview
      • Pretraining Transformers
      • Accelerating Training
      • Accelerating Inference
      • How to use DeepSpeed
      • Multi-node Training
    • 🌍CONCEPTUAL GUIDES
      • What are Habana's Gaudi and HPUs?
    • 🌍REFERENCE
      • Gaudi Trainer
      • Gaudi Configuration
      • Gaudi Stable Diffusion Pipeline
      • Distributed Runner
  • 🌍INTEL
    • BOINC AI Optimum Intel
    • Installation
    • 🌍NEURAL COMPRESSOR
      • Optimization
      • Distributed Training
      • Reference
    • 🌍OPENVINO
      • Models for inference
      • Optimization
      • Reference
  • 🌍AWS TRAINIUM/INFERENTIA
    • BOINC AI Optimum Neuron
  • 🌍FURIOSA
    • BOINC AI Optimum Furiosa
    • Installation
    • 🌍HOW-TO GUIDES
      • Overview
      • Modeling
      • Quantization
    • 🌍REFERENCE
      • Models
      • Configuration
      • Quantization
  • 🌍ONNX RUNTIME
    • Overview
    • Quick tour
    • 🌍HOW-TO GUIDES
      • Inference pipelines
      • Models for inference
      • How to apply graph optimization
      • How to apply dynamic and static quantization
      • How to accelerate training
      • Accelerated inference on NVIDIA GPUs
    • 🌍CONCEPTUAL GUIDES
      • ONNX And ONNX Runtime
    • 🌍REFERENCE
      • ONNX Runtime Models
      • Configuration
      • Optimization
      • Quantization
      • Trainer
  • 🌍EXPORTERS
    • Overview
    • The TasksManager
    • 🌍ONNX
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to ONNX
        • Add support for exporting an architecture to ONNX
      • 🌍REFERENCE
        • ONNX configurations
        • Export functions
    • 🌍TFLITE
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to TFLite
        • Add support for exporting an architecture to TFLite
      • 🌍REFERENCE
        • TFLite configurations
        • Export functions
  • 🌍TORCH FX
    • Overview
    • 🌍HOW-TO GUIDES
      • Optimization
    • 🌍CONCEPTUAL GUIDES
      • Symbolic tracer
    • 🌍REFERENCE
      • Optimization
  • 🌍BETTERTRANSFORMER
    • Overview
    • 🌍TUTORIALS
      • Convert Transformers models to use BetterTransformer
      • How to add support for new architectures?
  • 🌍LLM QUANTIZATION
    • GPTQ quantization
  • 🌍UTILITIES
    • Dummy input generators
    • Normalized configurations
Powered by GitBook
On this page
  • Optimization
  • Post-training optimization
  • Training-time optimization
  • Inference with Transformers pipeline
  1. INTEL
  2. OPENVINO

Optimization

PreviousModels for inferenceNextReference

Last updated 1 year ago

Optimization

🌍 Optimum Intel provides an openvino package that enables you to apply a variety of model compression methods such as quantization, pruning, on many models hosted on the 🌍 hub using the framework.

Post-training optimization

Post-training static quantization introduces an additional calibration step where data is fed through the network in order to compute the activations quantization parameters. Here is how to apply static quantization on a fine-tuned DistilBERT:

Copied

from functools import partial
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from optimum.intel import OVConfig, OVQuantizer

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
model = AutoModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# The directory where the quantized model will be saved
save_dir = "ptq_model"

def preprocess_function(examples, tokenizer):
    return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)

# Load the default quantization configuration detailing the quantization we wish to apply
quantization_config = OVConfig()
# Instantiate our OVQuantizer using the desired configuration
quantizer = OVQuantizer.from_pretrained(model)
# Create the calibration dataset used to perform static quantization
calibration_dataset = quantizer.get_calibration_dataset(
    "glue",
    dataset_config_name="sst2",
    preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
    num_samples=300,
    dataset_split="train",
)
# Apply static quantization and export the resulting quantized model to OpenVINO IR format
quantizer.quantize(
    quantization_config=quantization_config,
    calibration_dataset=calibration_dataset,
    save_directory=save_dir,
)
# Save the tokenizer
tokenizer.save_pretrained(save_dir)

The quantize() method applies post-training static quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.

Training-time optimization

Apart from optimizing a model after training like post-training quantization above, optimum.openvino also provides optimization methods during training, namely Quantization-Aware Training (QAT) and Joint Pruning, Quantization and Distillation (JPQD).

Quantization-Aware Training (QAT)

QAT simulates the effects of quantization during training, in order to alleviate its effects on the model’s accuracy. It is recommended in the case where post-training quantization results in high accuracy degradation. Here is an example on how to fine-tune a DistilBERT on the sst-2 task while applying quantization aware training (QAT).

Copied

  import evaluate
  import numpy as np
  from transformers import (
      AutoModelForSequenceClassification,
      AutoTokenizer,
      TrainingArguments,
      default_data_collator,
  )
  from datasets import load_dataset
- from transformers import Trainer
+ from optimum.intel import OVConfig, OVTrainer, OVModelForSequenceClassification

  model_id = "distilbert-base-uncased-finetuned-sst-2-english"
  model = AutoModelForSequenceClassification.from_pretrained(model_id)
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  # The directory where the quantized model will be saved
  save_dir = "qat_model"
  dataset = load_dataset("glue", "sst2")
  dataset = dataset.map(
      lambda examples: tokenizer(examples["sentence"], padding=True), batched=True
  )
  metric = evaluate.load("glue", "sst2")

  def compute_metrics(eval_preds):
      preds = np.argmax(eval_preds.predictions, axis=1)
      return metric.compute(predictions=preds, references=eval_preds.label_ids)

  # Load the default quantization configuration detailing the quantization we wish to apply
+ ov_config = OVConfig()

- trainer = Trainer(
+ trainer = OVTrainer(
      model=model,
      args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=True),
      train_dataset=dataset["train"].select(range(300)),
      eval_dataset=dataset["validation"],
      compute_metrics=compute_metrics,
      tokenizer=tokenizer,
      data_collator=default_data_collator,
+     ov_config=ov_config,
+     task="text-classification",
)

  # Train the model while applying quantization
  train_result = trainer.train()
  metrics = trainer.evaluate()
  # Export the quantized model to OpenVINO IR format and save it
  trainer.save_model()

  # Load the resulting quantized model
- model = AutoModelForSequenceClassification.from_pretrained(save_dir)
+ model = OVModelForSequenceClassification.from_pretrained(save_dir)

Joint Pruning, Quantization and Distillation (JPQD)

Other than quantization, compression methods like pruning and distillation are common in further improving the task performance and efficiency. Structured pruning slims a model for lower computational demands while distillation leverages knowledge of a teacher, usually, larger model to improve model prediction. Combining these methods with quantization can result in optimized model with significant efficiency improvement while enjoying good task accuracy retention. In optimum.openvino, OVTrainer provides the capability to jointly prune, quantize and distill a model during training. Following is an example on how to perform the optimization on BERT-base for the sst-2 task.

Copied

compression_config = [
    {
        "compression":
        {
        "algorithm":  "movement_sparsity",
        "params": {
            "warmup_start_epoch":  1,
            "warmup_end_epoch":    4,
            "importance_regularization_factor":  0.01,
            "enable_structured_masking":  True
        },
        "sparse_structure_by_scopes": [
            {"mode":  "block",   "sparse_factors": [32, 32], "target_scopes": "{re}.*BertAttention.*"},
            {"mode":  "per_dim", "axis":  0,                 "target_scopes": "{re}.*BertIntermediate.*"},
            {"mode":  "per_dim", "axis":  1,                 "target_scopes": "{re}.*BertOutput.*"},
        ],
        "ignored_scopes": ["{re}.*NNCFEmbedding", "{re}.*pooler.*", "{re}.*LayerNorm.*"]
        }
    },
    {
        "algorithm": "quantization",
        "weights": {"mode": "symmetric"}
        "activations": { "mode": "symmetric"},
    }
]

Once we have the config ready, we can start develop the training pipeline like the snippet below. Since we are customizing joint compression with config above, notice that OVConfig is initialized with config dictionary (JSON parsing to python dictionary is skipped for brevity). As for distillation, users are required to load the teacher model, it is just like a normal model loading with transformers API. OVTrainingArguments extends transformers’ TrainingArguments with distillation hyperparameters, i.e. distillation weightage and temperature for ease of use. The snippet below shows how we load a teacher model and create training arguments with OVTrainingArguments. Subsequently, the teacher model, with the instantiated OVConfig and OVTrainingArguments are fed to OVTrainer. Voila! that is all we need, the rest of the pipeline is identical to native transformers training.

Copied

- from transformers import Trainer, TrainingArguments
+ from optimum.intel import OVConfig, OVTrainer, OVTrainingArguments

  # Load teacher model
+ teacher_model = AutoModelForSequenceClassification.from_pretrained(teacher_model_or_path)

- ov_config = OVConfig()
+ ov_config = OVConfig(compression=compression_config)

  trainer = OVTrainer(
      model=model,
+     teacher_model=teacher_model,
-     args=TrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=True),
+     args=OVTrainingArguments(save_dir, num_train_epochs=1.0, do_train=True, do_eval=True, distillation_temperature=3, distillation_weight=0.9),
      train_dataset=dataset["train"].select(range(300)),
      eval_dataset=dataset["validation"],
      compute_metrics=compute_metrics,
      tokenizer=tokenizer,
      data_collator=default_data_collator,
+     ov_config=ov_config,
      task="text-classification",
  )

  # Train the model like usual, internally the training is applied with pruning, quantization and distillation
  train_result = trainer.train()
  metrics = trainer.evaluate()
  # Export the quantized model to OpenVINO IR format and save it
  trainer.save_model()

Inference with Transformers pipeline

Copied

from transformers import pipeline
from optimum.intel import OVModelForSequenceClassification

model_id = "helenai/distilbert-base-uncased-finetuned-sst-2-english-ov-int8"
ov_model = OVModelForSequenceClassification.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
cls_pipe = pipeline("text-classification", model=ov_model, tokenizer=tokenizer)
text = "He's a dreadful magician."
outputs = cls_pipe(text)

[{'label': 'NEGATIVE', 'score': 0.9840195178985596}]

First, we create a config dictionary to specify the target algorithms. As optimum.openvino relies on NNCF as backend, the config format follows NNCF specifications (see ). In the example config below, we specify pruning and quantization in a list of compression with thier hyperparameters. The pruning method closely resembles the work of whereas the quantization refers to QAT. With this configuration, the model under optimization will be initialized with pruning and quantization operators at the beginning of the training.

Known limitation: Current structured pruning with movement sparsity only supports BERT, Wav2vec2 and Swin family of models. See for more information.

More on the description and how to configure movement sparsity, see NNCF documentation .

More on available algorithms in NNCF, see documentation .

For complete JPQD scripts, please refer to examples provided .

Quantization-Aware Training (QAT) and knowledge distillation can also be combined in order to optimize Stable Diffusion models while maintaining accuracy. For more details, take a look at this .

After applying quantization on our model, we can then easily load it with our OVModelFor<Task> classes and perform inference with OpenVINO Runtime using the Transformers .

🌍
🌍
NNCF
here
Lagunas et al., 2021, Block Pruning For Faster Transformers
here
here
here
here
blog post
pipelines