Optimum
  • 🌍OVERVIEW
    • Optimum
    • Installation
    • Quick tour
    • Notebooks
    • 🌍CONCEPTUAL GUIDES
      • Quantization
  • 🌍HABANA
    • BOINC AI Optimum Habana
    • Installation
    • Quickstart
    • 🌍TUTORIALS
      • Overview
      • Single-HPU Training
      • Distributed Training
      • Run Inference
      • Stable Diffusion
      • LDM3D
    • 🌍HOW-TO GUIDES
      • Overview
      • Pretraining Transformers
      • Accelerating Training
      • Accelerating Inference
      • How to use DeepSpeed
      • Multi-node Training
    • 🌍CONCEPTUAL GUIDES
      • What are Habana's Gaudi and HPUs?
    • 🌍REFERENCE
      • Gaudi Trainer
      • Gaudi Configuration
      • Gaudi Stable Diffusion Pipeline
      • Distributed Runner
  • 🌍INTEL
    • BOINC AI Optimum Intel
    • Installation
    • 🌍NEURAL COMPRESSOR
      • Optimization
      • Distributed Training
      • Reference
    • 🌍OPENVINO
      • Models for inference
      • Optimization
      • Reference
  • 🌍AWS TRAINIUM/INFERENTIA
    • BOINC AI Optimum Neuron
  • 🌍FURIOSA
    • BOINC AI Optimum Furiosa
    • Installation
    • 🌍HOW-TO GUIDES
      • Overview
      • Modeling
      • Quantization
    • 🌍REFERENCE
      • Models
      • Configuration
      • Quantization
  • 🌍ONNX RUNTIME
    • Overview
    • Quick tour
    • 🌍HOW-TO GUIDES
      • Inference pipelines
      • Models for inference
      • How to apply graph optimization
      • How to apply dynamic and static quantization
      • How to accelerate training
      • Accelerated inference on NVIDIA GPUs
    • 🌍CONCEPTUAL GUIDES
      • ONNX And ONNX Runtime
    • 🌍REFERENCE
      • ONNX Runtime Models
      • Configuration
      • Optimization
      • Quantization
      • Trainer
  • 🌍EXPORTERS
    • Overview
    • The TasksManager
    • 🌍ONNX
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to ONNX
        • Add support for exporting an architecture to ONNX
      • 🌍REFERENCE
        • ONNX configurations
        • Export functions
    • 🌍TFLITE
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to TFLite
        • Add support for exporting an architecture to TFLite
      • 🌍REFERENCE
        • TFLite configurations
        • Export functions
  • 🌍TORCH FX
    • Overview
    • 🌍HOW-TO GUIDES
      • Optimization
    • 🌍CONCEPTUAL GUIDES
      • Symbolic tracer
    • 🌍REFERENCE
      • Optimization
  • 🌍BETTERTRANSFORMER
    • Overview
    • 🌍TUTORIALS
      • Convert Transformers models to use BetterTransformer
      • How to add support for new architectures?
  • 🌍LLM QUANTIZATION
    • GPTQ quantization
  • 🌍UTILITIES
    • Dummy input generators
    • Normalized configurations
Powered by GitBook
On this page
  • Inference pipelines with the ONNX Runtime accelerator
  • Optimum pipeline usage
  • Optimizing and quantizing in pipelines
  1. ONNX RUNTIME
  2. HOW-TO GUIDES

Inference pipelines

PreviousHOW-TO GUIDESNextModels for inference

Last updated 1 year ago

Inference pipelines with the ONNX Runtime accelerator

The pipeline() function makes it simple to use models from the for accelerated inference on a variety of tasks such as text classification, question answering and image classification.

You can also use the function from Transformers and provide your Optimum model class.

Currently the supported tasks are:

  • feature-extraction

  • text-classification

  • token-classification

  • question-answering

  • zero-shot-classification

  • text-generation

  • text2text-generation

  • summarization

  • translation

  • image-classification

  • automatic-speech-recognition

  • image-to-text

Optimum pipeline usage

While each task has an associated pipeline class, it is simpler to use the general pipeline() function which wraps all the task-specific pipelines in one object. The pipeline() function automatically loads a default model and tokenizer/feature-extractor capable of performing inference for your task.

  1. Start by creating a pipeline by specifying an inference task:

Copied

>>> from optimum.pipelines import pipeline

>>> classifier = pipeline(task="text-classification", accelerator="ort")
  1. Pass your input text/image to the pipeline() function:

Copied

>>> classifier("I like you. I love you.")
[{'label': 'POSITIVE', 'score': 0.9998838901519775}]

Note: The default models used in the pipeline() function are not optimized for inference or quantized, so there won’t be a performance improvement compared to their PyTorch counterparts.

Using vanilla Transformers model and converting to ONNX

To be able to load the model with the ONNX Runtime backend, the export to ONNX needs to be supported for the considered architecture.

Once you have picked an appropriate model, you can create the pipeline() by specifying the model repo:

Copied

>>> from optimum.pipelines import pipeline

# The model will be loaded to an ORTModelForQuestionAnswering.
>>> onnx_qa = pipeline("question-answering", model="deepset/roberta-base-squad2", accelerator="ort")
>>> question = "What's my name?"
>>> context = "My name is Philipp and I live in Nuremberg."

>>> pred = onnx_qa(question=question, context=context)

It is also possible to load it with the from_pretrained(model_name_or_path, export=True) method associated with the ORTModelForXXX class.

Copied

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> from optimum.pipelines import pipeline

>>> tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

>>> # Loading the PyTorch checkpoint and converting to the ONNX format by providing
>>> # export=True
>>> model = ORTModelForQuestionAnswering.from_pretrained(
...     "deepset/roberta-base-squad2",
...     export=True
... )

>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer, accelerator="ort")
>>> question = "What's my name?"
>>> context = "My name is Philipp and I live in Nuremberg."

>>> pred = onnx_qa(question=question, context=context)

Using Optimum models

Copied

>>> from optimum.pipelines import pipeline

>>> onnx_qa = pipeline("question-answering", model="optimum/roberta-base-squad2", accelerator="ort")
>>> question = "What's my name?"
>>> context = "My name is Philipp and I live in Nuremberg."

>>> pred = onnx_qa(question=question, context=context)

It is also possible to load it with the from_pretrained(model_name_or_path) method associated with the ORTModelForXXX class.

Copied

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> from optimum.pipelines import pipeline

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")

>>> # Loading directly an ONNX model from a model repo.
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer, accelerator="ort")
>>> question = "What's my name?"
>>> context = "My name is Philipp and I live in Nuremberg."

>>> pred = onnx_qa(question=question, context=context)

Optimizing and quantizing in pipelines

Quantizing with the ORTQuantizer

Copied

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import (
...     AutoQuantizationConfig,
...     ORTModelForSequenceClassification,
...     ORTQuantizer
... )
>>> from optimum.pipelines import pipeline

>>> # Load the tokenizer and export the model to the ONNX format
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_dir = "distilbert_quantized"

>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

>>> # Load the quantization configuration detailing the quantization we wish to apply
>>> qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
>>> quantizer = ORTQuantizer.from_pretrained(model)

>>> # Apply dynamic quantization and save the resulting model
>>> quantizer.quantize(save_dir=save_dir, quantization_config=qconfig)
>>> # Load the quantized model from a local repository
>>> model = ORTModelForSequenceClassification.from_pretrained(save_dir)

>>> # Create the transformers pipeline
>>> onnx_clx = pipeline("text-classification", model=model, accelerator="ort")
>>> text = "I like the new ORT pipeline"
>>> pred = onnx_clx(text)
>>> print(pred)
>>> # [{'label': 'POSITIVE', 'score': 0.9974810481071472}]

>>> # Save and push the model to the hub (in practice save_dir could be used here instead)
>>> model.save_pretrained("new_path_for_directory")
>>> model.push_to_hub("new_path_for_directory", repository_id="my-onnx-repo", use_auth_token=True)

Optimizing with ORTOptimizer

Copied

>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import (
...     AutoOptimizationConfig,
...     ORTModelForSequenceClassification,
...     ORTOptimizer
... )
>>> from optimum.onnxruntime.configuration import OptimizationConfig
>>> from optimum.pipelines import pipeline

>>> # Load the tokenizer and export the model to the ONNX format
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_dir = "distilbert_optimized"

>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

>>> # Load the optimization configuration detailing the optimization we wish to apply
>>> optimization_config = AutoOptimizationConfig.O3()
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
# Load the optimized model from a local repository
>>> model = ORTModelForSequenceClassification.from_pretrained(save_dir)

# Create the transformers pipeline
>>> onnx_clx = pipeline("text-classification", model=model, accelerator="ort")
>>> text = "I like the new ORT pipeline"
>>> pred = onnx_clx(text)
>>> print(pred)
>>> # [{'label': 'POSITIVE', 'score': 0.9973127245903015}]

# Save and push the model to the hub
>>> tokenizer.save_pretrained("new_path_for_directory")
>>> model.save_pretrained("new_path_for_directory")
>>> model.push_to_hub("new_path_for_directory", repository_id="my-onnx-repo", use_auth_token=True)

The pipeline() function accepts any supported model from the . There are tags on the Model Hub that allow you to filter for a model you’d like to use for your task.

You can check the list of supported architectures .

For example, here is how you can load the class for question answering:

The pipeline() function is tightly integrated with the and can load ONNX models directly.

For example, here is how you can load the class for question answering:

The pipeline() function can not only run inference on vanilla ONNX Runtime checkpoints - you can also use checkpoints optimized with the and the .

Below you can find two examples of how you could use the and the to optimize/quantize your model and use it for inference afterwards.

🌍
🌍
Model Hub
pipeline()
BOINC AI Hub
here
ORTModelForQuestionAnswering
BOINC AI Hub
ORTModelForQuestionAnswering
ORTQuantizer
ORTOptimizer
ORTOptimizer
ORTQuantizer