Optimum
  • 🌍OVERVIEW
    • Optimum
    • Installation
    • Quick tour
    • Notebooks
    • 🌍CONCEPTUAL GUIDES
      • Quantization
  • 🌍HABANA
    • BOINC AI Optimum Habana
    • Installation
    • Quickstart
    • 🌍TUTORIALS
      • Overview
      • Single-HPU Training
      • Distributed Training
      • Run Inference
      • Stable Diffusion
      • LDM3D
    • 🌍HOW-TO GUIDES
      • Overview
      • Pretraining Transformers
      • Accelerating Training
      • Accelerating Inference
      • How to use DeepSpeed
      • Multi-node Training
    • 🌍CONCEPTUAL GUIDES
      • What are Habana's Gaudi and HPUs?
    • 🌍REFERENCE
      • Gaudi Trainer
      • Gaudi Configuration
      • Gaudi Stable Diffusion Pipeline
      • Distributed Runner
  • 🌍INTEL
    • BOINC AI Optimum Intel
    • Installation
    • 🌍NEURAL COMPRESSOR
      • Optimization
      • Distributed Training
      • Reference
    • 🌍OPENVINO
      • Models for inference
      • Optimization
      • Reference
  • 🌍AWS TRAINIUM/INFERENTIA
    • BOINC AI Optimum Neuron
  • 🌍FURIOSA
    • BOINC AI Optimum Furiosa
    • Installation
    • 🌍HOW-TO GUIDES
      • Overview
      • Modeling
      • Quantization
    • 🌍REFERENCE
      • Models
      • Configuration
      • Quantization
  • 🌍ONNX RUNTIME
    • Overview
    • Quick tour
    • 🌍HOW-TO GUIDES
      • Inference pipelines
      • Models for inference
      • How to apply graph optimization
      • How to apply dynamic and static quantization
      • How to accelerate training
      • Accelerated inference on NVIDIA GPUs
    • 🌍CONCEPTUAL GUIDES
      • ONNX And ONNX Runtime
    • 🌍REFERENCE
      • ONNX Runtime Models
      • Configuration
      • Optimization
      • Quantization
      • Trainer
  • 🌍EXPORTERS
    • Overview
    • The TasksManager
    • 🌍ONNX
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to ONNX
        • Add support for exporting an architecture to ONNX
      • 🌍REFERENCE
        • ONNX configurations
        • Export functions
    • 🌍TFLITE
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to TFLite
        • Add support for exporting an architecture to TFLite
      • 🌍REFERENCE
        • TFLite configurations
        • Export functions
  • 🌍TORCH FX
    • Overview
    • 🌍HOW-TO GUIDES
      • Optimization
    • 🌍CONCEPTUAL GUIDES
      • Symbolic tracer
    • 🌍REFERENCE
      • Optimization
  • 🌍BETTERTRANSFORMER
    • Overview
    • 🌍TUTORIALS
      • Convert Transformers models to use BetterTransformer
      • How to add support for new architectures?
  • 🌍LLM QUANTIZATION
    • GPTQ quantization
  • 🌍UTILITIES
    • Dummy input generators
    • Normalized configurations
Powered by GitBook
On this page
  • Optimum Inference with OpenVINO
  • Switching from Transformers to Optimum
  • Sequence-to-sequence models
  • Stable Diffusion
  • Stable Diffusion XL
  • Supported tasks
  1. INTEL
  2. OPENVINO

Models for inference

PreviousOPENVINONextOptimization

Last updated 1 year ago

Optimum Inference with OpenVINO

Optimum Intel can be used to load optimized models from the and create pipelines to run inference with OpenVINO Runtime without rewriting your APIs.

Switching from Transformers to Optimum

You can now easily perform inference with OpenVINO Runtime on a variety of Intel processors ( the full list of supported devices). For that, just replace the AutoModelForXxx class with the corresponding OVModelForXxx class. To load a Transformers model and convert it to the OpenVINO format on-the-fly, you can set export=True when loading your model.

Here is an example on how to perform inference with OpenVINO Runtime for a text classification class:

Copied

- from transformers import AutoModelForSequenceClassification
+ from optimum.intel import OVModelForSequenceClassification
  from transformers import AutoTokenizer, pipeline

  model_id = "distilbert-base-uncased-finetuned-sst-2-english"
- model = AutoModelForSequenceClassification.from_pretrained(model_id)
+ model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
  tokenizer = AutoTokenizer.from_pretrained(model_id)
  cls_pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
  outputs = cls_pipe("He's a dreadful magician.")

  [{'label': 'NEGATIVE', 'score': 0.9919503927230835}]

To easily save the resulting model, you can use the save_pretrained() method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model.

Copied

# Save the exported model
save_directory = "openvino_distilbert"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

By default, OVModelForXxx support dynamic shapes, enabling inputs of every shapes. To speed up inference, static shapes can be enabled by giving the desired inputs shapes.

Copied

# Fix the batch size to 1 and the sequence length to 9
model.reshape(1, 9)
# Compile the model before the first inference
model.compile()

When fixing the shapes with the reshape() method, inference cannot be performed with an input of a different shape. When instantiating your pipeline, you can specify the maximum total input sequence length after tokenization in order for shorter sequences to be padded and for longer sequences to be truncated.

Copied

from datasets import load_dataset
from transformers import AutoTokenizer, pipeline
from evaluate import evaluator
from optimum.intel import OVModelForQuestionAnswering

model_id = "distilbert-base-cased-distilled-squad"
model = OVModelForQuestionAnswering.from_pretrained(model_id, export=True)
model.reshape(1, 384)
tokenizer = AutoTokenizer.from_pretrained(model_id)
eval_dataset = load_dataset("squad", split="validation").select(range(50))
task_evaluator = evaluator("question-answering")
qa_pipe = pipeline(
    "question-answering",
    model=model,
    tokenizer=tokenizer,
    max_seq_len=384,
    padding="max_length",
    truncation=True,
)
metric = task_evaluator.compute(model_or_pipeline=qa_pipe, data=eval_dataset, metric="squad")

Copied

# Static shapes speed up inference
model.reshape(1, 9)
model.to("gpu")
# Compile the model before the first inference
model.compile()

By default the model will be compiled when instantiating our OVModel. In the case where the model is reshaped or placed to another device, the model will need to be recompiled again, which will happen by default before the first inference (thus inflating the latency of the first inference). To avoid an unnecessary compilation, you can disable the first compilation by setting compile=False. The model can be compiled before the first inference with model.compile().

Copied

from optimum.intel import OVModelForSequenceClassification

model_id = "distilbert-base-uncased-finetuned-sst-2-english"
# Load the model and disable the model compilation
model = OVModelForSequenceClassification.from_pretrained(model_id, export=True, compile=False)
# Reshape to a static sequence length of 128
model.reshape(1,128)
# Compile the model before the first inference
model.compile()

It is possible to pass an ov_config parameter to from_pretrained() with custom OpenVINO configuration values. This can be used for example to enable full precision inference on devices where FP16 or BF16 inference precision is used by default.

Copied

model = OVModelForSequenceClassification.from_pretrained(model_id, ov_config={"INFERENCE_PRECISION_HINT":"f32"})

Copied

model = OVModelForSequenceClassification.from_pretrained(model_id, ov_config={"CACHE_DIR":""})

Sequence-to-sequence models

Sequence-to-sequence (Seq2Seq) models, that generate a new sequence from an input, can also be used when running inference with OpenVINO. When Seq2Seq models are exported to the OpenVINO IR, they are decomposed into two parts : the encoder and the β€œdecoder” (which actually consists of the decoder with the language modeling head), that are later combined during inference. To speed up sequential decoding, a cache with pre-computed key/values hidden-states will be used by default. An additional model component will be exported: the β€œdecoder” with pre-computed key/values as one of its inputs. This specific export comes from the fact that during the first pass, the decoder has no pre-computed key/values hidden-states, while during the rest of the generation past key/values will be used to speed up sequential decoding. To disable this cache, set use_cache=False in the from_pretrained() method.

Here is an example on how you can run inference for a translation task using a T5 model and then export it to OpenVINO IR:

Copied

from transformers import AutoTokenizer, pipeline
from optimum.intel import OVModelForSeq2SeqLM

model_id = "t5-small"
model = OVModelForSeq2SeqLM.from_pretrained(model_id, export=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
translation_pipe = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer)
text = "He never went out without a book under his arm, and he often came back with two."
result = translation_pipe(text)

# Save the exported model
save_directory = "openvino_t5"
model.save_pretrained(save_directory)
tokenizer.save_pretrained(save_directory)

[{'translation_text': "Il n'est jamais sorti sans un livre sous son bras, et il est souvent revenu avec deux."}]

Stable Diffusion

Stable Diffusion models can also be used when running inference with OpenVINO. When Stable Diffusion models are exported to the OpenVINO format, they are decomposed into three components that are later combined during inference:

  • The text encoder

  • The U-NET

  • The VAE encoder

  • The VAE decoder

Make sure you have 🌍 Diffusers installed.

To install diffusers:

Copied

pip install optimum[diffusers]

Text-to-Image

Here is an example of how you can load an OpenVINO Stable Diffusion model and run inference using OpenVINO Runtime:

Copied

from optimum.intel import OVStableDiffusionPipeline

model_id = "echarlaix/stable-diffusion-v1-5-openvino"
pipeline = OVStableDiffusionPipeline.from_pretrained(model_id)
prompt = "sailing ship in storm by Rembrandt"
images = pipeline(prompt).images

To load your PyTorch model and convert it to OpenVINO on-the-fly, you can set export=True.

Copied

model_id = "runwayml/stable-diffusion-v1-5"
pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=True)
# Don't forget to save the exported model
pipeline.save_pretrained("openvino-sd-v1-5")

To further speed up inference, the model can be statically reshaped :

Copied

# Define the shapes related to the inputs and desired outputs
batch_size = 1
num_images_per_prompt = 1
height = 512
width = 512

# Statically reshape the model
pipeline.reshape(batch_size=batch_size, height=height, width=width, num_images_per_prompt=num_images_per_prompt)
# Compile the model before the first inference
pipeline.compile()

# Run inference
images = pipeline(prompt, height=height, width=width, num_images_per_prompt=num_images_per_prompt).images

In case you want to change any parameters such as the outputs height or width, you’ll need to statically reshape your model once again.

Text-to-Image with Textual Inversion

Here is an example of how you can load an OpenVINO Stable Diffusion model with pre-trained textual inversion embeddings and run inference using OpenVINO Runtime:

First, you can run original pipeline without textual inversion

Copied

from optimum.intel import OVStableDiffusionPipeline
import numpy as np

model_id = "echarlaix/stable-diffusion-v1-5-openvino"
prompt = "A <cat-toy> back-pack"
# Set a random seed for better comparison
np.random.seed(42)

pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=False, compile=False)
pipeline.compile()
image1 = pipeline(prompt, num_inference_steps=50).images[0]
image1.save("stable_diffusion_v1_5_without_textual_inversion.png")

Copied

# Reset stable diffusion pipeline
pipeline.clear_requests()

# Load textual inversion into stable diffusion pipeline
pipeline.load_textual_inversion("sd-concepts-library/cat-toy", "<cat-toy>")

# Compile the model before the first inference
pipeline.compile()
image2 = pipeline(prompt, num_inference_steps=50).images[0]
image2.save("stable_diffusion_v1_5_with_textual_inversion.png")

The left image shows the generation result of original stable diffusion v1.5, the right image shows the generation result of stable diffusion v1.5 with textual inversion.

Image-to-Image

Copied

import requests
import torch
from PIL import Image
from io import BytesIO
from optimum.intel import OVStableDiffusionImg2ImgPipeline

model_id = "runwayml/stable-diffusion-v1-5"
pipeline = OVStableDiffusionImg2ImgPipeline.from_pretrained(model_id, export=True)

url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
response = requests.get(url)
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))
prompt = "A fantasy landscape, trending on artstation"
image = pipeline(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images[0]
image.save("fantasy_landscape.png")

Stable Diffusion XL

Before using OVtableDiffusionXLPipeline make sure to have diffusers and invisible_watermark installed. You can install the libraries as follows:

Copied

pip install diffusers
pip install invisible-watermark>=0.2.0

Text-to-Image

Copied

from optimum.intel import OVStableDiffusionXLPipeline

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
base = OVStableDiffusionXLPipeline.from_pretrained(model_id)
prompt = "train station by Caspar David Friedrich"
image = base(prompt).images[0]
image.save("train_station.png")

Text-to-Image with Textual Inversion

First, you can run original pipeline without textual inversion

Copied

from optimum.intel import OVStableDiffusionXLPipeline
import numpy as np

model_id = "stabilityai/stable-diffusion-xl-base-1.0"
prompt = "charturnerv2, multiple views of the same character in the same outfit, a character turnaround of a beautiful woman wearing a red jacket and black shirt, best quality, intricate details."
# Set a random seed for better comparison
np.random.seed(112)

base = OVStableDiffusionXLPipeline.from_pretrained(model_id, export=False, compile=False)
base.compile()
image1 = base(prompt, num_inference_steps=50).images[0]
image1.save("sdxl_without_textual_inversion.png")

Copied

# Reset stable diffusion pipeline
base.clear_requests()

# Load textual inversion into stable diffusion pipeline
base.load_textual_inversion("./charturnerv2.pt", "charturnerv2")

# Compile the model before the first inference
base.compile()
image2 = base(prompt, num_inference_steps=50).images[0]
image2.save("sdxl_with_textual_inversion.png")

The left image shows the generation result of the original SDXL base 1.0, the right image shows the generation result of SDXL base 1.0 with textual inversion.

Image-to-Image

Here is an example of how you can load a PyTorch SDXL model, convert it to OpenVINO on-the-fly and run inference using OpenVINO Runtime for image-to-image:

Copied

from optimum.intel import OVStableDiffusionXLImg2ImgPipeline
from diffusers.utils import load_image

model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
pipeline = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(model_id, export=True)

url = "https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/sd_xl/castle_friedrich.png"
image = load_image(url).convert("RGB")
prompt = "medieval castle by Caspar David Friedrich"
image = pipeline(prompt, image=image).images[0]
# Don't forget to save your OpenVINO model so that you can load it without exporting it with `export=True`
pipeline.save_pretrained("openvino-sd-xl-refiner-1.0")

Refining the image output

Copied

from optimum.intel import OVStableDiffusionXLImg2ImgPipeline

model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
refiner = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(model_id, export=True)

image = base(prompt=prompt, output_type="latent").images[0]
image = refiner(prompt=prompt, image=image[None, :]).images[0]

Supported tasks

As shown in the table below, each task is associated with a class enabling to automatically load your model.

Task
Auto Class

text-classification

OVModelForSequenceClassification

token-classification

OVModelForTokenClassification

question-answering

OVModelForQuestionAnswering

audio-classification

OVModelForAudioClassification

image-classification

OVModelForImageClassification

feature-extraction

OVModelForFeatureExtraction

fill-mask

OVModelForMaskedLM

text-generation

OVModelForCausalLM

text2text-generation

OVModelForSeq2SeqLM

text-to-image

OVStableDiffusionPipeline

text-to-image

OVStableDiffusionXLPipeline

image-to-image

OVStableDiffusionImg2ImgPipeline

image-to-image

OVStableDiffusionXLImg2ImgPipeline

inpaint

OVStableDiffusionInpaintPipeline

See the for more information about parameters, and examples for different tasks.

To run inference on Intel integrated or discrete GPU, use .to("gpu"). On GPU, models run in FP16 precision by default. (See about installing drivers for GPU inference).

Optimum Intel leverages OpenVINO’s model caching to speed up model compiling. By default a model_cache directory is created in the model’s directory in the . To override this, use the ov_config parameter and set CACHE_DIR to a different value. To disable model caching, set CACHE_DIR to an empty string.

Then, you can load textual inversion embedding and run pipeline with same prompt again

Here is an example of how you can load a SDXL OpenVINO model from and run inference using OpenVINO Runtime:

Here is an example of how you can load an SDXL OpenVINO model from with pre-trained textual inversion embeddings and run inference using OpenVINO Runtime:

Then, you can load textual inversion embedding and run pipeline with same prompt again

The image can be refined by making use of a model like . In this case, you only have to output the latents from the base model.

🌍
🌍
BOINC AI Hub
see
reference documentation
OpenVINO documentation
BOINC AI Hub cache
sd-concepts-library/cat-toy
stabilityai/stable-diffusion-xl-base-1.0
stabilityai/stable-diffusion-xl-base-1.0
charturnerv2
stabilityai/stable-diffusion-xl-refiner-1.0