Optimum
  • ๐ŸŒOVERVIEW
    • Optimum
    • Installation
    • Quick tour
    • Notebooks
    • ๐ŸŒCONCEPTUAL GUIDES
      • Quantization
  • ๐ŸŒHABANA
    • BOINC AI Optimum Habana
    • Installation
    • Quickstart
    • ๐ŸŒTUTORIALS
      • Overview
      • Single-HPU Training
      • Distributed Training
      • Run Inference
      • Stable Diffusion
      • LDM3D
    • ๐ŸŒHOW-TO GUIDES
      • Overview
      • Pretraining Transformers
      • Accelerating Training
      • Accelerating Inference
      • How to use DeepSpeed
      • Multi-node Training
    • ๐ŸŒCONCEPTUAL GUIDES
      • What are Habana's Gaudi and HPUs?
    • ๐ŸŒREFERENCE
      • Gaudi Trainer
      • Gaudi Configuration
      • Gaudi Stable Diffusion Pipeline
      • Distributed Runner
  • ๐ŸŒINTEL
    • BOINC AI Optimum Intel
    • Installation
    • ๐ŸŒNEURAL COMPRESSOR
      • Optimization
      • Distributed Training
      • Reference
    • ๐ŸŒOPENVINO
      • Models for inference
      • Optimization
      • Reference
  • ๐ŸŒAWS TRAINIUM/INFERENTIA
    • BOINC AI Optimum Neuron
  • ๐ŸŒFURIOSA
    • BOINC AI Optimum Furiosa
    • Installation
    • ๐ŸŒHOW-TO GUIDES
      • Overview
      • Modeling
      • Quantization
    • ๐ŸŒREFERENCE
      • Models
      • Configuration
      • Quantization
  • ๐ŸŒONNX RUNTIME
    • Overview
    • Quick tour
    • ๐ŸŒHOW-TO GUIDES
      • Inference pipelines
      • Models for inference
      • How to apply graph optimization
      • How to apply dynamic and static quantization
      • How to accelerate training
      • Accelerated inference on NVIDIA GPUs
    • ๐ŸŒCONCEPTUAL GUIDES
      • ONNX And ONNX Runtime
    • ๐ŸŒREFERENCE
      • ONNX Runtime Models
      • Configuration
      • Optimization
      • Quantization
      • Trainer
  • ๐ŸŒEXPORTERS
    • Overview
    • The TasksManager
    • ๐ŸŒONNX
      • Overview
      • ๐ŸŒHOW-TO GUIDES
        • Export a model to ONNX
        • Add support for exporting an architecture to ONNX
      • ๐ŸŒREFERENCE
        • ONNX configurations
        • Export functions
    • ๐ŸŒTFLITE
      • Overview
      • ๐ŸŒHOW-TO GUIDES
        • Export a model to TFLite
        • Add support for exporting an architecture to TFLite
      • ๐ŸŒREFERENCE
        • TFLite configurations
        • Export functions
  • ๐ŸŒTORCH FX
    • Overview
    • ๐ŸŒHOW-TO GUIDES
      • Optimization
    • ๐ŸŒCONCEPTUAL GUIDES
      • Symbolic tracer
    • ๐ŸŒREFERENCE
      • Optimization
  • ๐ŸŒBETTERTRANSFORMER
    • Overview
    • ๐ŸŒTUTORIALS
      • Convert Transformers models to use BetterTransformer
      • How to add support for new architectures?
  • ๐ŸŒLLM QUANTIZATION
    • GPTQ quantization
  • ๐ŸŒUTILITIES
    • Dummy input generators
    • Normalized configurations
Powered by GitBook
On this page
  • Quick tour
  • Accelerated inference
  • Accelerated training
  • Out of the box ONNX export
  • PyTorchโ€™s BetterTransformer support
  • torch.fx integration
  1. OVERVIEW

Quick tour

Quick tour

This quick tour is intended for developers who are ready to dive into the code and see examples of how to integrate ๐ŸŒ Optimum into their model training and inference workflows.

Accelerated inference

OpenVINO

To load a model and run inference with OpenVINO Runtime, you can just replace your AutoModelForXxx class with the corresponding OVModelForXxx class. If you want to load a PyTorch checkpoint, set export=True to convert your model to the OpenVINO IR (Intermediate Representation).

Copied

- from transformers import AutoModelForSequenceClassification
+ from optimum.intel.openvino import OVModelForSequenceClassification
  from transformers import AutoTokenizer, pipeline

  # Download a tokenizer and model from the Hub and convert to OpenVINO format
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  model_id = "distilbert-base-uncased-finetuned-sst-2-english"
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)

  # Run inference!
  classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
  results = classifier("He's a dreadful magician.")

ONNX Runtime

To accelerate inference with ONNX Runtime, ๐ŸŒ Optimum uses configuration objects to define parameters for graph optimization and quantization. These objects are then used to instantiate dedicated optimizers and quantizers.

Copied

>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import AutoTokenizer

>>> model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_directory = "tmp/onnx/"

>>> # Load a model from transformers and export it to ONNX
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)

>>> # Save the ONNX model and tokenizer
>>> ort_model.save_pretrained(save_directory)
>>> tokenizer.save_pretrained(save_directory)

Letโ€™s see now how we can apply dynamic quantization with ONNX Runtime:

Copied

>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig
>>> from optimum.onnxruntime import ORTQuantizer

>>> # Define the quantization methodology
>>> qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
>>> quantizer = ORTQuantizer.from_pretrained(ort_model)

>>> # Apply dynamic quantization on the model
>>> quantizer.quantize(save_dir=save_directory, quantization_config=qconfig)

In this example, weโ€™ve quantized a model from the BOINC AI Hub, in the same manner we can quantize a model hosted locally by providing the path to the directory containing the model weights. The result from applying the quantize() method is a model_quantized.onnx file that can be used to run inference. Hereโ€™s an example of how to load an ONNX Runtime model and generate predictions with it:

Copied

>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import pipeline, AutoTokenizer

>>> model = ORTModelForSequenceClassification.from_pretrained(save_directory, file_name="model_quantized.onnx")
>>> tokenizer = AutoTokenizer.from_pretrained(save_directory)
>>> classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
>>> results = classifier("I love burritos!")

Accelerated training

Habana

Copied

- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiTrainer, GaudiTrainingArguments

  # Download a pretrained model from the Hub
  model = AutoModelForXxx.from_pretrained("bert-base-uncased")

  # Define the training arguments
- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
      output_dir="path/to/save/folder/",
+     use_habana=True,
+     use_lazy_mode=True,
+     gaudi_config_name="Habana/bert-base-uncased",
      ...
  )

  # Initialize the trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
      ...
  )

  # Use Habana Gaudi processor for training!
  trainer.train()

ONNX Runtime

Copied

- from transformers import Trainer, TrainingArguments
+ from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments

  # Download a pretrained model from the Hub
  model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")

  # Define the training arguments
- training_args = TrainingArguments(
+ training_args = ORTTrainingArguments(
      output_dir="path/to/save/folder/",
      optim="adamw_ort_fused",
      ...
  )

  # Create a ONNX Runtime Trainer
- trainer = Trainer(
+ trainer = ORTTrainer(
      model=model,
      args=training_args,
      train_dataset=train_dataset,
+     feature="text-classification", # The model type to export to ONNX
      ...
  )

  # Use ONNX Runtime for training!
  trainer.train()

Out of the box ONNX export

The Optimum library handles out of the box the ONNX export of Transformers and Diffusers models!

Exporting a model to ONNX is as simple as

Copied

optimum-cli export onnx --model gpt2 gpt2_onnx/

Check out the help for more options:

Copied

optimum-cli export onnx --help

PyTorchโ€™s BetterTransformer support

Copied

>>> from optimum.bettertransformer import BetterTransformer
>>> from transformers import AutoModelForSequenceClassification

>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
>>> model = BetterTransformer.transform(model)

torch.fx integration

PreviousInstallationNextNotebooks

Last updated 1 year ago

You can find more examples in the and in the .

Before applying quantization or optimization, first we need to load our model. To load a model and run inference with ONNX Runtime, you can just replace the canonical Transformers class with the corresponding class. If you want to load from a PyTorch checkpoint, set export=True to export your model to the ONNX format.

You can find more examples in the and in the .

To train transformers on Habanaโ€™s Gaudi processors, ๐ŸŒ Optimum provides a GaudiTrainer that is very similar to the ๐ŸŒ Transformers . Here is a simple example:

You can find more examples in the and in the .

To train transformers with ONNX Runtimeโ€™s acceleration features, ๐ŸŒ Optimum provides a ORTTrainer that is very similar to the ๐ŸŒ Transformers . Here is a simple example:

You can find more examples in the and in the .

Check out the for more.

is a free-lunch PyTorch-native optimization to gain x1.25 - x4 speedup on the inference of Transformer-based models. It has been marked as stable in . We integrated BetterTransformer with the most-used models from the ๐ŸŒ Transformers libary, and using the integration is as simple as:

Check out the for more details, and the to find out more about the integration!

Optimum integrates with torch.fx, providing as a one-liner several graph transformations. We aim at supporting a better management of through torch.fx, both for quantization-aware training (QAT) and post-training quantization (PTQ).

Check out the and for more!

๐ŸŒ
documentation
examples
AutoModelForXxx
ORTModelForXxx
documentation
examples
Trainer
documentation
examples
Trainer
documentation
examples
documentation
BetterTransformer
PyTorch 1.13
documentation
blog post on PyTorchโ€™s Medium
quantization
documentation
reference