This quick tour is intended for developers who are ready to dive into the code and see examples of how to integrate ๐ Optimum into their model training and inference workflows.
Accelerated inference
OpenVINO
To load a model and run inference with OpenVINO Runtime, you can just replace your AutoModelForXxx class with the corresponding OVModelForXxx class. If you want to load a PyTorch checkpoint, set export=True to convert your model to the OpenVINO IR (Intermediate Representation).
Copied
- from transformers import AutoModelForSequenceClassification
+ from optimum.intel.openvino import OVModelForSequenceClassification
from transformers import AutoTokenizer, pipeline
# Download a tokenizer and model from the Hub and convert to OpenVINO format
tokenizer = AutoTokenizer.from_pretrained(model_id)
model_id = "distilbert-base-uncased-finetuned-sst-2-english"
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
# Run inference!
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
results = classifier("He's a dreadful magician.")
ONNX Runtime
To accelerate inference with ONNX Runtime, ๐ Optimum uses configuration objects to define parameters for graph optimization and quantization. These objects are then used to instantiate dedicated optimizers and quantizers.
Copied
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import AutoTokenizer
>>> model_checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_directory = "tmp/onnx/"
>>> # Load a model from transformers and export it to ONNX
>>> tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
>>> ort_model = ORTModelForSequenceClassification.from_pretrained(model_checkpoint, export=True)
>>> # Save the ONNX model and tokenizer
>>> ort_model.save_pretrained(save_directory)
>>> tokenizer.save_pretrained(save_directory)
Letโs see now how we can apply dynamic quantization with ONNX Runtime:
Copied
>>> from optimum.onnxruntime.configuration import AutoQuantizationConfig
>>> from optimum.onnxruntime import ORTQuantizer
>>> # Define the quantization methodology
>>> qconfig = AutoQuantizationConfig.arm64(is_static=False, per_channel=False)
>>> quantizer = ORTQuantizer.from_pretrained(ort_model)
>>> # Apply dynamic quantization on the model
>>> quantizer.quantize(save_dir=save_directory, quantization_config=qconfig)
In this example, weโve quantized a model from the BOINC AI Hub, in the same manner we can quantize a model hosted locally by providing the path to the directory containing the model weights. The result from applying the quantize() method is a model_quantized.onnx file that can be used to run inference. Hereโs an example of how to load an ONNX Runtime model and generate predictions with it:
Copied
>>> from optimum.onnxruntime import ORTModelForSequenceClassification
>>> from transformers import pipeline, AutoTokenizer
>>> model = ORTModelForSequenceClassification.from_pretrained(save_directory, file_name="model_quantized.onnx")
>>> tokenizer = AutoTokenizer.from_pretrained(save_directory)
>>> classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)
>>> results = classifier("I love burritos!")
Accelerated training
Habana
Copied
- from transformers import Trainer, TrainingArguments
+ from optimum.habana import GaudiTrainer, GaudiTrainingArguments
# Download a pretrained model from the Hub
model = AutoModelForXxx.from_pretrained("bert-base-uncased")
# Define the training arguments
- training_args = TrainingArguments(
+ training_args = GaudiTrainingArguments(
output_dir="path/to/save/folder/",
+ use_habana=True,
+ use_lazy_mode=True,
+ gaudi_config_name="Habana/bert-base-uncased",
...
)
# Initialize the trainer
- trainer = Trainer(
+ trainer = GaudiTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
...
)
# Use Habana Gaudi processor for training!
trainer.train()
ONNX Runtime
Copied
- from transformers import Trainer, TrainingArguments
+ from optimum.onnxruntime import ORTTrainer, ORTTrainingArguments
# Download a pretrained model from the Hub
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased")
# Define the training arguments
- training_args = TrainingArguments(
+ training_args = ORTTrainingArguments(
output_dir="path/to/save/folder/",
optim="adamw_ort_fused",
...
)
# Create a ONNX Runtime Trainer
- trainer = Trainer(
+ trainer = ORTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
+ feature="text-classification", # The model type to export to ONNX
...
)
# Use ONNX Runtime for training!
trainer.train()
Out of the box ONNX export
The Optimum library handles out of the box the ONNX export of Transformers and Diffusers models!
Exporting a model to ONNX is as simple as
Copied
optimum-cli export onnx --model gpt2 gpt2_onnx/
Check out the help for more options:
Copied
optimum-cli export onnx --help
PyTorchโs BetterTransformer support
Copied
>>> from optimum.bettertransformer import BetterTransformer
>>> from transformers import AutoModelForSequenceClassification
>>> model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
>>> model = BetterTransformer.transform(model)
Before applying quantization or optimization, first we need to load our model. To load a model and run inference with ONNX Runtime, you can just replace the canonical Transformers class with the corresponding class. If you want to load from a PyTorch checkpoint, set export=True to export your model to the ONNX format.
You can find more examples in the and in the .
To train transformers on Habanaโs Gaudi processors, ๐ Optimum provides a GaudiTrainer that is very similar to the ๐ Transformers . Here is a simple example:
You can find more examples in the and in the .
To train transformers with ONNX Runtimeโs acceleration features, ๐ Optimum provides a ORTTrainer that is very similar to the ๐ Transformers . Here is a simple example:
You can find more examples in the and in the .
Check out the for more.
is a free-lunch PyTorch-native optimization to gain x1.25 - x4 speedup on the inference of Transformer-based models. It has been marked as stable in . We integrated BetterTransformer with the most-used models from the ๐ Transformers libary, and using the integration is as simple as:
Check out the for more details, and the to find out more about the integration!
Optimum integrates with torch.fx, providing as a one-liner several graph transformations. We aim at supporting a better management of through torch.fx, both for quantization-aware training (QAT) and post-training quantization (PTQ).