Models for inference

Optimum Inference with ONNX Runtime

Optimum is a utility package for building and running inference with accelerated runtime like ONNX Runtime. Optimum can be used to load optimized models from the BOINC AI Hub and create pipelines to run accelerated inference without rewriting your APIs.

Switching from Transformers to Optimum

The optimum.onnxruntime.ORTModelForXXX model classes are API compatible with BOINC AI Transformers models. This means you can just replace your AutoModelForXXX class with the corresponding ORTModelForXXX class in optimum.onnxruntime.

You do not need to adapt your code to get it to work with ORTModelForXXX classes:

Copied

from transformers import AutoTokenizer, pipeline
-from transformers import AutoModelForQuestionAnswering
+from optimum.onnxruntime import ORTModelForQuestionAnswering

-model = AutoModelForQuestionAnswering.from_pretrained("deepset/roberta-base-squad2") # PyTorch checkpoint
+model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2") # ONNX checkpoint
tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

onnx_qa = pipeline("question-answering",model=model,tokenizer=tokenizer)

question = "What's my name?"
context = "My name is Philipp and I live in Nuremberg."
pred = onnx_qa(question, context)

Loading a vanilla Transformers model

Because the model you want to work with might not be already converted to ONNX, ORTModel includes a method to convert vanilla Transformers models to ONNX ones. Simply pass export=True to the from_pretrained() method, and your model will be loaded and converted to ONNX on-the-fly:

Copied

Pushing ONNX models to the BOINC AI Hub

It is also possible, just as with regular PreTrainedModels, to push your ORTModelForXXX to the BOINC AI Model Hub:

Copied

Sequence-to-sequence models

Sequence-to-sequence (Seq2Seq) models can also be used when running inference with ONNX Runtime. When Seq2Seq models are exported to the ONNX format, they are decomposed into three parts that are later combined during inference:

  • The encoder part of the model

  • The decoder part of the model + the language modeling head

  • The same decoder part of the model + language modeling head but taking and using pre-computed key / values as inputs and outputs. This makes inference faster.

Here is an example of how you can load a T5 model to the ONNX format and run inference for a translation task:

Copied

Stable Diffusion

Stable Diffusion models can also be used when running inference with ONNX Runtime. When Stable Diffusion models are exported to the ONNX format, they are split into four components that are later combined during inference:

  • The text encoder

  • The U-NET

  • The VAE encoder

  • The VAE decoder

Make sure you have πŸ€— Diffusers installed.

To install diffusers:

Copied

Text-to-Image

Here is an example of how you can load an ONNX Stable Diffusion model and run inference using ONNX Runtime:

Copied

To load your PyTorch model and convert it to ONNX on-the-fly, you can set export=True.

Copied

Image-to-Image

Copied

Inpaint

Copied

Stable Diffusion XL

Before using ORTStableDiffusionXLPipeline make sure to have diffusers and invisible_watermark installed. You can install the libraries as follows:

Copied

Text-to-Image

Here is an example of how you can load a SDXL ONNX model from stabilityai/stable-diffusion-xl-base-1.0 and run inference using ONNX Runtime :

Copied

Image-to-Image

Here is an example of how you can load a PyTorch SDXL model, convert it to ONNX on-the-fly and run inference using ONNX Runtime for image-to-image :

Copied

Refining the image output

The image can be refined by making use of a model like stabilityai/stable-diffusion-xl-refiner-1.0. In this case, you only have to output the latents from the base model.

Copied

Latent Consistency Models

Text-to-Image

Here is an example of how you can load a Latent Consistency Models (LCMs) from SimianLuo/LCM_Dreamshaper_v7 and run inference using ONNX Runtime :

Copied

Last updated