# Inference pipelines

## Inference pipelines with the ONNX Runtime accelerator

The `pipeline()` function makes it simple to use models from the [Model Hub](https://huggingface.co/models) for accelerated inference on a variety of tasks such as text classification, question answering and image classification.

You can also use the [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#pipelines) function from Transformers and provide your Optimum model class.

Currently the supported tasks are:

* `feature-extraction`
* `text-classification`
* `token-classification`
* `question-answering`
* `zero-shot-classification`
* `text-generation`
* `text2text-generation`
* `summarization`
* `translation`
* `image-classification`
* `automatic-speech-recognition`
* `image-to-text`

### Optimum pipeline usage

While each task has an associated pipeline class, it is simpler to use the general `pipeline()` function which wraps all the task-specific pipelines in one object. The `pipeline()` function automatically loads a default model and tokenizer/feature-extractor capable of performing inference for your task.

1. Start by creating a pipeline by specifying an inference task:

Copied

```
>>> from optimum.pipelines import pipeline

>>> classifier = pipeline(task="text-classification", accelerator="ort")
```

2. Pass your input text/image to the `pipeline()` function:

Copied

```
>>> classifier("I like you. I love you.")
[{'label': 'POSITIVE', 'score': 0.9998838901519775}]
```

*Note: The default models used in the `pipeline()` function are not optimized for inference or quantized, so there won’t be a performance improvement compared to their PyTorch counterparts.*

#### Using vanilla Transformers model and converting to ONNX

The `pipeline()` function accepts any supported model from the [BOINC AI Hub](https://huggingface.co/models). There are tags on the Model Hub that allow you to filter for a model you’d like to use for your task.

To be able to load the model with the ONNX Runtime backend, the export to ONNX needs to be supported for the considered architecture.

You can check the list of supported architectures [here](https://huggingface.co/docs/optimum/exporters/onnx/overview#overview).

Once you have picked an appropriate model, you can create the `pipeline()` by specifying the model repo:

Copied

```
>>> from optimum.pipelines import pipeline

# The model will be loaded to an ORTModelForQuestionAnswering.
>>> onnx_qa = pipeline("question-answering", model="deepset/roberta-base-squad2", accelerator="ort")
>>> question = "What's my name?"
>>> context = "My name is Philipp and I live in Nuremberg."

>>> pred = onnx_qa(question=question, context=context)
```

It is also possible to load it with the `from_pretrained(model_name_or_path, export=True)` method associated with the `ORTModelForXXX` class.

For example, here is how you can load the [ORTModelForQuestionAnswering](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModelForQuestionAnswering) class for question answering:

Copied

```
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> from optimum.pipelines import pipeline

>>> tokenizer = AutoTokenizer.from_pretrained("deepset/roberta-base-squad2")

>>> # Loading the PyTorch checkpoint and converting to the ONNX format by providing
>>> # export=True
>>> model = ORTModelForQuestionAnswering.from_pretrained(
...     "deepset/roberta-base-squad2",
...     export=True
... )

>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer, accelerator="ort")
>>> question = "What's my name?"
>>> context = "My name is Philipp and I live in Nuremberg."

>>> pred = onnx_qa(question=question, context=context)
```

#### Using Optimum models

The `pipeline()` function is tightly integrated with the [BOINC AI Hub](https://huggingface.co/models) and can load ONNX models directly.

Copied

```
>>> from optimum.pipelines import pipeline

>>> onnx_qa = pipeline("question-answering", model="optimum/roberta-base-squad2", accelerator="ort")
>>> question = "What's my name?"
>>> context = "My name is Philipp and I live in Nuremberg."

>>> pred = onnx_qa(question=question, context=context)
```

It is also possible to load it with the `from_pretrained(model_name_or_path)` method associated with the `ORTModelForXXX` class.

For example, here is how you can load the [ORTModelForQuestionAnswering](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/modeling_ort#optimum.onnxruntime.ORTModelForQuestionAnswering) class for question answering:

Copied

```
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import ORTModelForQuestionAnswering
>>> from optimum.pipelines import pipeline

>>> tokenizer = AutoTokenizer.from_pretrained("optimum/roberta-base-squad2")

>>> # Loading directly an ONNX model from a model repo.
>>> model = ORTModelForQuestionAnswering.from_pretrained("optimum/roberta-base-squad2")

>>> onnx_qa = pipeline("question-answering", model=model, tokenizer=tokenizer, accelerator="ort")
>>> question = "What's my name?"
>>> context = "My name is Philipp and I live in Nuremberg."

>>> pred = onnx_qa(question=question, context=context)
```

### Optimizing and quantizing in pipelines

The `pipeline()` function can not only run inference on vanilla ONNX Runtime checkpoints - you can also use checkpoints optimized with the [ORTQuantizer](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/quantization#optimum.onnxruntime.ORTQuantizer) and the [ORTOptimizer](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/optimization#optimum.onnxruntime.ORTOptimizer).

Below you can find two examples of how you could use the [ORTOptimizer](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/optimization#optimum.onnxruntime.ORTOptimizer) and the [ORTQuantizer](https://huggingface.co/docs/optimum/main/en/onnxruntime/package_reference/quantization#optimum.onnxruntime.ORTQuantizer) to optimize/quantize your model and use it for inference afterwards.

#### Quantizing with the ORTQuantizer

Copied

```
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import (
...     AutoQuantizationConfig,
...     ORTModelForSequenceClassification,
...     ORTQuantizer
... )
>>> from optimum.pipelines import pipeline

>>> # Load the tokenizer and export the model to the ONNX format
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_dir = "distilbert_quantized"

>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

>>> # Load the quantization configuration detailing the quantization we wish to apply
>>> qconfig = AutoQuantizationConfig.avx512_vnni(is_static=False, per_channel=True)
>>> quantizer = ORTQuantizer.from_pretrained(model)

>>> # Apply dynamic quantization and save the resulting model
>>> quantizer.quantize(save_dir=save_dir, quantization_config=qconfig)
>>> # Load the quantized model from a local repository
>>> model = ORTModelForSequenceClassification.from_pretrained(save_dir)

>>> # Create the transformers pipeline
>>> onnx_clx = pipeline("text-classification", model=model, accelerator="ort")
>>> text = "I like the new ORT pipeline"
>>> pred = onnx_clx(text)
>>> print(pred)
>>> # [{'label': 'POSITIVE', 'score': 0.9974810481071472}]

>>> # Save and push the model to the hub (in practice save_dir could be used here instead)
>>> model.save_pretrained("new_path_for_directory")
>>> model.push_to_hub("new_path_for_directory", repository_id="my-onnx-repo", use_auth_token=True)
```

#### Optimizing with ORTOptimizer

Copied

```
>>> from transformers import AutoTokenizer
>>> from optimum.onnxruntime import (
...     AutoOptimizationConfig,
...     ORTModelForSequenceClassification,
...     ORTOptimizer
... )
>>> from optimum.onnxruntime.configuration import OptimizationConfig
>>> from optimum.pipelines import pipeline

>>> # Load the tokenizer and export the model to the ONNX format
>>> model_id = "distilbert-base-uncased-finetuned-sst-2-english"
>>> save_dir = "distilbert_optimized"

>>> tokenizer = AutoTokenizer.from_pretrained(model_id)
>>> model = ORTModelForSequenceClassification.from_pretrained(model_id, export=True)

>>> # Load the optimization configuration detailing the optimization we wish to apply
>>> optimization_config = AutoOptimizationConfig.O3()
>>> optimizer = ORTOptimizer.from_pretrained(model)

>>> optimizer.optimize(save_dir=save_dir, optimization_config=optimization_config)
# Load the optimized model from a local repository
>>> model = ORTModelForSequenceClassification.from_pretrained(save_dir)

# Create the transformers pipeline
>>> onnx_clx = pipeline("text-classification", model=model, accelerator="ort")
>>> text = "I like the new ORT pipeline"
>>> pred = onnx_clx(text)
>>> print(pred)
>>> # [{'label': 'POSITIVE', 'score': 0.9973127245903015}]

# Save and push the model to the hub
>>> tokenizer.save_pretrained("new_path_for_directory")
>>> model.save_pretrained("new_path_for_directory")
>>> model.push_to_hub("new_path_for_directory", repository_id="my-onnx-repo", use_auth_token=True)
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://boinc-ai.gitbook.io/optimum/onnx-runtime/how-to-guides/inference-pipelines.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
