Models for inference
Optimum Inference with OpenVINO
Optimum Intel can be used to load optimized models from the BOINC AI Hub and create pipelines to run inference with OpenVINO Runtime without rewriting your APIs.
Switching from Transformers to Optimum
You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors (see the full list of supported devices). For that, just replace the AutoModelForXxx
class with the corresponding OVModelForXxx
class. To load a Transformers model and convert it to the OpenVINO format on-the-fly, you can set export=True
when loading your model.
Here is an example on how to perform inference with OpenVINO Runtime for a text classification class:
Copied
See the reference documentation for more information about parameters, and examples for different tasks.
To easily save the resulting model, you can use the save_pretrained()
method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model.
Copied
By default, OVModelForXxx
support dynamic shapes, enabling inputs of every shapes. To speed up inference, static shapes can be enabled by giving the desired inputs shapes.
Copied
When fixing the shapes with the reshape()
method, inference cannot be performed with an input of a different shape. When instantiating your pipeline, you can specify the maximum total input sequence length after tokenization in order for shorter sequences to be padded and for longer sequences to be truncated.
Copied
To run inference on Intel integrated or discrete GPU, use .to("gpu")
. On GPU, models run in FP16 precision by default. (See OpenVINO documentation about installing drivers for GPU inference).
Copied
By default the model will be compiled when instantiating our OVModel
. In the case where the model is reshaped or placed to another device, the model will need to be recompiled again, which will happen by default before the first inference (thus inflating the latency of the first inference). To avoid an unnecessary compilation, you can disable the first compilation by setting compile=False
. The model can be compiled before the first inference with model.compile()
.
Copied
It is possible to pass an ov_config
parameter to from_pretrained()
with custom OpenVINO configuration values. This can be used for example to enable full precision inference on devices where FP16 or BF16 inference precision is used by default.
Copied
Optimum Intel leverages OpenVINO’s model caching to speed up model compiling. By default a model_cache
directory is created in the model’s directory in the BOINC AI Hub cache. To override this, use the ov_config parameter and set CACHE_DIR
to a different value. To disable model caching, set CACHE_DIR
to an empty string.
Copied
Sequence-to-sequence models
Sequence-to-sequence (Seq2Seq) models, that generate a new sequence from an input, can also be used when running inference with OpenVINO. When Seq2Seq models are exported to the OpenVINO IR, they are decomposed into two parts : the encoder and the “decoder” (which actually consists of the decoder with the language modeling head), that are later combined during inference. To speed up sequential decoding, a cache with pre-computed key/values hidden-states will be used by default. An additional model component will be exported: the “decoder” with pre-computed key/values as one of its inputs. This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding. To disable this cache, set use_cache=False
in the from_pretrained()
method.
Here is an example on how you can run inference for a translation task using a T5 model and then export it to OpenVINO IR:
Copied
Stable Diffusion
Stable Diffusion models can also be used when running inference with OpenVINO. When Stable Diffusion models are exported to the OpenVINO format, they are decomposed into three components that are later combined during inference:
The text encoder
The U-NET
The VAE encoder
The VAE decoder
Make sure you have 🌍 Diffusers installed.
To install diffusers
:
Copied
Text-to-Image
Here is an example of how you can load an OpenVINO Stable Diffusion model and run inference using OpenVINO Runtime:
Copied
To load your PyTorch model and convert it to OpenVINO on-the-fly, you can set export=True
.
Copied
To further speed up inference, the model can be statically reshaped :
Copied
In case you want to change any parameters such as the outputs height or width, you’ll need to statically reshape your model once again.
Text-to-Image with Textual Inversion
Here is an example of how you can load an OpenVINO Stable Diffusion model with pre-trained textual inversion embeddings and run inference using OpenVINO Runtime:
First, you can run original pipeline without textual inversion
Copied
Then, you can load sd-concepts-library/cat-toy textual inversion embedding and run pipeline with same prompt again
Copied
The left image shows the generation result of original stable diffusion v1.5, the right image shows the generation result of stable diffusion v1.5 with textual inversion.
Image-to-Image
Copied
Stable Diffusion XL
Before using OVtableDiffusionXLPipeline
make sure to have diffusers
and invisible_watermark
installed. You can install the libraries as follows:
Copied
Text-to-Image
Here is an example of how you can load a SDXL OpenVINO model from stabilityai/stable-diffusion-xl-base-1.0 and run inference using OpenVINO Runtime:
Copied
Text-to-Image with Textual Inversion
Here is an example of how you can load an SDXL OpenVINO model from stabilityai/stable-diffusion-xl-base-1.0 with pre-trained textual inversion embeddings and run inference using OpenVINO Runtime:
First, you can run original pipeline without textual inversion
Copied
Then, you can load charturnerv2 textual inversion embedding and run pipeline with same prompt again
Copied
Image-to-Image
Here is an example of how you can load a PyTorch SDXL model, convert it to OpenVINO on-the-fly and run inference using OpenVINO Runtime for image-to-image:
Copied
Refining the image output
The image can be refined by making use of a model like stabilityai/stable-diffusion-xl-refiner-1.0. In this case, you only have to output the latents from the base model.
Copied
Supported tasks
As shown in the table below, each task is associated with a class enabling to automatically load your model.
text-classification
OVModelForSequenceClassification
token-classification
OVModelForTokenClassification
question-answering
OVModelForQuestionAnswering
audio-classification
OVModelForAudioClassification
image-classification
OVModelForImageClassification
feature-extraction
OVModelForFeatureExtraction
fill-mask
OVModelForMaskedLM
text-generation
OVModelForCausalLM
text2text-generation
OVModelForSeq2SeqLM
text-to-image
OVStableDiffusionPipeline
text-to-image
OVStableDiffusionXLPipeline
image-to-image
OVStableDiffusionImg2ImgPipeline
image-to-image
OVStableDiffusionXLImg2ImgPipeline
inpaint
OVStableDiffusionInpaintPipeline
Last updated