Inference pipelines

Inference pipelines with the ONNX Runtime accelerator

The pipeline() function makes it simple to use models from the Model Hubarrow-up-right for accelerated inference on a variety of tasks such as text classification, question answering and image classification.

You can also use the pipeline()arrow-up-right function from Transformers and provide your Optimum model class.

Currently the supported tasks are:

  • feature-extraction

  • text-classification

  • token-classification

  • question-answering

  • zero-shot-classification

  • text-generation

  • text2text-generation

  • summarization

  • translation

  • image-classification

  • automatic-speech-recognition

  • image-to-text

Optimum pipeline usage

While each task has an associated pipeline class, it is simpler to use the general pipeline() function which wraps all the task-specific pipelines in one object. The pipeline() function automatically loads a default model and tokenizer/feature-extractor capable of performing inference for your task.

  1. Start by creating a pipeline by specifying an inference task:

Copied

  1. Pass your input text/image to the pipeline() function:

Copied

Note: The default models used in the pipeline() function are not optimized for inference or quantized, so there won’t be a performance improvement compared to their PyTorch counterparts.

Using vanilla Transformers model and converting to ONNX

The pipeline() function accepts any supported model from the BOINC AI Hubarrow-up-right. There are tags on the Model Hub that allow you to filter for a model you’d like to use for your task.

To be able to load the model with the ONNX Runtime backend, the export to ONNX needs to be supported for the considered architecture.

You can check the list of supported architectures herearrow-up-right.

Once you have picked an appropriate model, you can create the pipeline() by specifying the model repo:

Copied

It is also possible to load it with the from_pretrained(model_name_or_path, export=True) method associated with the ORTModelForXXX class.

For example, here is how you can load the ORTModelForQuestionAnsweringarrow-up-right class for question answering:

Copied

Using Optimum models

The pipeline() function is tightly integrated with the BOINC AI Hubarrow-up-right and can load ONNX models directly.

Copied

It is also possible to load it with the from_pretrained(model_name_or_path) method associated with the ORTModelForXXX class.

For example, here is how you can load the ORTModelForQuestionAnsweringarrow-up-right class for question answering:

Copied

Optimizing and quantizing in pipelines

The pipeline() function can not only run inference on vanilla ONNX Runtime checkpoints - you can also use checkpoints optimized with the ORTQuantizerarrow-up-right and the ORTOptimizerarrow-up-right.

Below you can find two examples of how you could use the ORTOptimizerarrow-up-right and the ORTQuantizerarrow-up-right to optimize/quantize your model and use it for inference afterwards.

Quantizing with the ORTQuantizer

Copied

Optimizing with ORTOptimizer

Copied

Last updated