Quick tour

Quick tour

This quick tour is intended for developers who are ready to dive into the code and see examples of how to integrate ๐ŸŒ Optimum into their model training and inference workflows.

Accelerated inference

OpenVINO

To load a model and run inference with OpenVINO Runtime, you can just replace your AutoModelForXxx class with the corresponding OVModelForXxx class. If you want to load a PyTorch checkpoint, set export=True to convert your model to the OpenVINO IR (Intermediate Representation).

Copied

- from transformers import AutoModelForSequenceClassification
+ from optimum.intel.openvino import OVModelForSequenceClassification
  from transformers import AutoTokenizer, pipeline

  # Download a tokenizer and model from the Hub and convert to OpenVINO format
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  model_id = "distilbert-base-uncased-finetuned-sst-2-english"
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)

  # Run inference!
  classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
  results = classifier("He's a dreadful magician.")

You can find more examples in the documentationarrow-up-right and in the examplesarrow-up-right.

ONNX Runtime

To accelerate inference with ONNX Runtime, ๐ŸŒ Optimum uses configuration objects to define parameters for graph optimization and quantization. These objects are then used to instantiate dedicated optimizers and quantizers.

Before applying quantization or optimization, first we need to load our model. To load a model and run inference with ONNX Runtime, you can just replace the canonical Transformers AutoModelForXxxarrow-up-right class with the corresponding ORTModelForXxxarrow-up-right class. If you want to load from a PyTorch checkpoint, set export=True to export your model to the ONNX format.

Copied

Letโ€™s see now how we can apply dynamic quantization with ONNX Runtime:

Copied

In this example, weโ€™ve quantized a model from the BOINC AI Hub, in the same manner we can quantize a model hosted locally by providing the path to the directory containing the model weights. The result from applying the quantize() method is a model_quantized.onnx file that can be used to run inference. Hereโ€™s an example of how to load an ONNX Runtime model and generate predictions with it:

Copied

You can find more examples in the documentationarrow-up-right and in the examplesarrow-up-right.

Accelerated training

Habana

To train transformers on Habanaโ€™s Gaudi processors, ๐ŸŒ Optimum provides a GaudiTrainer that is very similar to the ๐ŸŒ Transformers Trainerarrow-up-right. Here is a simple example:

Copied

You can find more examples in the documentationarrow-up-right and in the examplesarrow-up-right.

ONNX Runtime

To train transformers with ONNX Runtimeโ€™s acceleration features, ๐ŸŒ Optimum provides a ORTTrainer that is very similar to the ๐ŸŒ Transformers Trainerarrow-up-right. Here is a simple example:

Copied

You can find more examples in the documentationarrow-up-right and in the examplesarrow-up-right.

Out of the box ONNX export

The Optimum library handles out of the box the ONNX export of Transformers and Diffusers models!

Exporting a model to ONNX is as simple as

Copied

Check out the help for more options:

Copied

Check out the documentationarrow-up-right for more.

PyTorchโ€™s BetterTransformer support

BetterTransformerarrow-up-right is a free-lunch PyTorch-native optimization to gain x1.25 - x4 speedup on the inference of Transformer-based models. It has been marked as stable in PyTorch 1.13arrow-up-right. We integrated BetterTransformer with the most-used models from the ๐ŸŒ Transformers libary, and using the integration is as simple as:

Copied

Check out the documentationarrow-up-right for more details, and the blog post on PyTorchโ€™s Mediumarrow-up-right to find out more about the integration!

torch.fx integration

Optimum integrates with torch.fx, providing as a one-liner several graph transformations. We aim at supporting a better management of quantizationarrow-up-right through torch.fx, both for quantization-aware training (QAT) and post-training quantization (PTQ).

Check out the documentationarrow-up-right and referencearrow-up-right for more!

Last updated