Configuration
Configuration
The configuration classes are the way to specify how a task should be done. There are two tasks supported with the ONNX Runtime package:
Optimization: Performed by the ORTOptimizer, this task can be tweaked using an OptimizationConfig.
Quantization: Performed by the ORTQuantizer, quantization can be set using a QuantizationConfig. A calibration step is required in some cases (post training static quantization), which can be specified using a CalibrationConfig.
OptimizationConfig
class optimum.onnxruntime.OptimizationConfig
( optimization_level: int = 1optimize_for_gpu: bool = Falsefp16: bool = Falseoptimize_with_onnxruntime_only: typing.Optional[bool] = Noneenable_transformers_specific_optimizations: bool = Truedisable_gelu: typing.Optional[bool] = Nonedisable_gelu_fusion: bool = Falsedisable_layer_norm: typing.Optional[bool] = Nonedisable_layer_norm_fusion: bool = Falsedisable_attention: typing.Optional[bool] = Nonedisable_attention_fusion: bool = Falsedisable_skip_layer_norm: typing.Optional[bool] = Nonedisable_skip_layer_norm_fusion: bool = Falsedisable_bias_skip_layer_norm: typing.Optional[bool] = Nonedisable_bias_skip_layer_norm_fusion: bool = Falsedisable_bias_gelu: typing.Optional[bool] = Nonedisable_bias_gelu_fusion: bool = Falsedisable_embed_layer_norm: bool = Truedisable_embed_layer_norm_fusion: bool = Trueenable_gelu_approximation: bool = Falseuse_mask_index: bool = Falseno_attention_mask: bool = Falsedisable_shape_inference: bool = Falseuse_multi_head_attention: bool = Falseenable_gemm_fast_gelu_fusion: bool = Falseuse_raw_attention_mask: bool = Falsedisable_group_norm_fusion: bool = Truedisable_packed_kv: bool = True )
Parameters
optimization_level (
int
, defaults to 1) — Optimization level performed by ONNX Runtime of the loaded graph. Supported optimization level are 0, 1, 2 and 99.0: will disable all optimizations
1: will enable basic optimizations
2: will enable basic and extended optimizations, including complex node fusions applied to the nodes assigned to the CPU or CUDA execution provider, making the resulting optimized graph hardware dependent
99: will enable all available optimizations including layout optimizations
optimize_for_gpu (
bool
, defaults toFalse
) — Whether to optimize the model for GPU inference. The optimized graph might contain operators for GPU or CPU only whenoptimization_level
> 1.fp16 (
bool
, defaults toFalse
) — Whether all weights and nodes should be converted from float32 to float16.enable_transformers_specific_optimizations (
bool
, defaults toTrue
) — Whether to only usetransformers
specific optimizations on top of ONNX Runtime general optimizations.disable_gelu_fusion (
bool
, defaults toFalse
) — Whether to disable the Gelu fusion.disable_layer_norm_fusion (
bool
, defaults toFalse
) — Whether to disable Layer Normalization fusion.disable_attention_fusion (
bool
, defaults toFalse
) — Whether to disable Attention fusion.disable_skip_layer_norm_fusion (
bool
, defaults toFalse
) — Whether to disable SkipLayerNormalization fusion.disable_bias_skip_layer_norm_fusion (
bool
, defaults toFalse
) — Whether to disable Add Bias and SkipLayerNormalization fusion.disable_bias_gelu_fusion (
bool
, defaults toFalse
) — Whether to disable Add Bias and Gelu / FastGelu fusion.disable_embed_layer_norm_fusion (
bool
, defaults toTrue
) — Whether to disable EmbedLayerNormalization fusion. The default value is set toTrue
since this fusion is incompatible with ONNX Runtime quantization.enable_gelu_approximation (
bool
, defaults toFalse
) — Whether to enable Gelu / BiasGelu to FastGelu conversion. The default value is set toFalse
since this approximation might slightly impact the model’s accuracy.use_mask_index (
bool
, defaults toFalse
) — Whether to use mask index instead of raw attention mask in the attention operator.no_attention_mask (
bool
, defaults toFalse
) — Whether to not use attention masks. Only works for bert model type.disable_embed_layer_norm (
bool
, defaults toTrue
) — Whether to disable EmbedLayerNormalization fusion. The default value is set toTrue
since this fusion is incompatible with ONNX Runtime quantizationdisable_shape_inference (
bool
, defaults toFalse
) — Whether to disable symbolic shape inference. The default value is set toFalse
but symbolic shape inference might cause issues sometimes.use_multi_head_attention (
bool
, defaults toFalse
) — Experimental argument. Use MultiHeadAttention instead of Attention operator, which has merged weights for Q/K/V projection, which might be faster in some cases since 3 MatMul is merged into one.” “Note that MultiHeadAttention might be slower than Attention when qkv are not packed. ”enable_gemm_fast_gelu_fusion (
bool
, defaults toFalse
) — Enable GemmfastGelu fusion.use_raw_attention_mask (
bool
, defaults toFalse
) — Use raw attention mask. Use this option if your input is not right-side padding. This might deactivate fused attention and get worse performance.disable_group_norm_fusion (
bool
, defaults toTrue
) — Do not fuse GroupNorm. Only works for model_type=unet.disable_packed_kv (
bool
, defaults toTrue
) — Do not use packed kv in cross attention. Only works for model_type=unet.
OptimizationConfig is the configuration class handling all the ONNX Runtime optimization parameters. There are two stacks of optimizations:
The ONNX Runtime general-purpose optimization tool: it can work on any ONNX model.
The ONNX Runtime transformers optimization tool: it can only work on a subset of transformers models.
class optimum.onnxruntime.AutoOptimizationConfig
( )
Factory to create common OptimizationConfig
.
O1
( for_gpu: bool = False**kwargs ) → OptimizationConfig
Parameters
for_gpu (
bool
, defaults toFalse
) — Whether the model to optimize will run on GPU, some optimizations depends on the hardware the model will run on. Only needed for optimization_level > 1.kwargs (
Dict[str, Any]
) — Arguments to provide to the~OptimizationConfig
constructor.
Returns
OptimizationConfig
The OptimizationConfig
corresponding to the O1 optimization level.
Creates an O1 ~OptimizationConfig
.
O2
( for_gpu: bool = False**kwargs ) → OptimizationConfig
Parameters
for_gpu (
bool
, defaults toFalse
) — Whether the model to optimize will run on GPU, some optimizations depends on the hardware the model will run on. Only needed for optimization_level > 1.kwargs (
Dict[str, Any]
) — Arguments to provide to the~OptimizationConfig
constructor.
Returns
OptimizationConfig
The OptimizationConfig
corresponding to the O2 optimization level.
Creates an O2 ~OptimizationConfig
.
O3
( for_gpu: bool = False**kwargs ) → OptimizationConfig
Parameters
for_gpu (
bool
, defaults toFalse
) — Whether the model to optimize will run on GPU, some optimizations depends on the hardware the model will run on. Only needed for optimization_level > 1.kwargs (
Dict[str, Any]
) — Arguments to provide to the~OptimizationConfig
constructor.
Returns
OptimizationConfig
The OptimizationConfig
corresponding to the O3 optimization level.
Creates an O3 ~OptimizationConfig
.
O4
( for_gpu: bool = True**kwargs ) → OptimizationConfig
Parameters
for_gpu (
bool
, defaults toFalse
) — Whether the model to optimize will run on GPU, some optimizations depends on the hardware the model will run on. Only needed for optimization_level > 1.kwargs (
Dict[str, Any]
) — Arguments to provide to the~OptimizationConfig
constructor.
Returns
OptimizationConfig
The OptimizationConfig
corresponding to the O4 optimization level.
Creates an O4 ~OptimizationConfig
.
with_optimization_level
( optimization_level: strfor_gpu: bool = False**kwargs ) → OptimizationConfig
Parameters
optimization_level (
str
) — The optimization level, the following values are allowed:O1: Basic general optimizations
O2: Basic and extended general optimizations, transformers-specific fusions.
O3: Same as O2 with Fast Gelu approximation.
O4: Same as O3 with mixed precision.
for_gpu (
bool
, defaults toFalse
) — Whether the model to optimize will run on GPU, some optimizations depends on the hardware the model will run on. Only needed for optimization_level > 1.kwargs (
Dict[str, Any]
) — Arguments to provide to the~OptimizationConfig
constructor.
Returns
OptimizationConfig
The OptimizationConfig
corresponding to the requested optimization level.
Creates an ~OptimizationConfig
with pre-defined arguments according to an optimization level.
QuantizationConfig
class optimum.onnxruntime.QuantizationConfig
( is_static: boolformat: QuantFormatmode: QuantizationMode = <QuantizationMode.QLinearOps: 1>activations_dtype: QuantType = <QuantType.QUInt8: 1>activations_symmetric: bool = Falseweights_dtype: QuantType = <QuantType.QInt8: 0>weights_symmetric: bool = Trueper_channel: bool = Falsereduce_range: bool = Falsenodes_to_quantize: typing.List[str] = <factory>nodes_to_exclude: typing.List[str] = <factory>operators_to_quantize: typing.List[str] = <factory>qdq_add_pair_to_weight: bool = Falseqdq_dedicated_pair: bool = Falseqdq_op_type_per_channel_support_to_axis: typing.Dict[str, int] = <factory> )
Parameters
is_static (
bool
) — Whether to apply static quantization or dynamic quantization.format (
QuantFormat
) — Targeted ONNX Runtime quantization representation format. For the Operator Oriented (QOperator) format, all the quantized operators have their own ONNX definitions. For the Tensor Oriented (QDQ) format, the model is quantized by inserting QuantizeLinear / DeQuantizeLinear operators.mode (
QuantizationMode
, defaults toQuantizationMode.QLinearOps
) — Targeted ONNX Runtime quantization mode, default is QLinearOps to match QDQ format. When targeting dynamic quantization mode, the default value isQuantizationMode.IntegerOps
whereas the default value for static quantization mode isQuantizationMode.QLinearOps
.activations_dtype (
QuantType
, defaults toQuantType.QUInt8
) — The quantization data types to use for the activations.activations_symmetric (
bool
, defaults toFalse
) — Whether to apply symmetric quantization on the activations.weights_dtype (
QuantType
, defaults toQuantType.QInt8
) — The quantization data types to use for the weights.weights_symmetric (
bool
, defaults toTrue
) — Whether to apply symmetric quantization on the weights.per_channel (
bool
, defaults toFalse
) — Whether we should quantize per-channel (also known as “per-row”). Enabling this can increase overall accuracy while making the quantized model heavier.reduce_range (
bool
, defaults toFalse
) — Whether to use reduce-range 7-bits integers instead of 8-bits integers.nodes_to_quantize (
List[str]
, defaults to[]
) — List of the nodes names to quantize. When unspecified, all nodes will be quantized. If empty, all nodes being operators fromoperators_to_quantize
will be quantized.nodes_to_exclude (
List[str]
, defaults to[]
) — List of the nodes names to exclude when applying quantization. The list of nodes in a model can be found loading the ONNX model through onnx.load, or through visual inspection with netron.operators_to_quantize (
List[str]
) — List of the operators types to quantize. Defaults to all quantizable operators for the given quantization mode and format. Quantizable operators can be found at https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/registry.py.qdq_add_pair_to_weight (
bool
, defaults toFalse
) — By default, floating-point weights are quantized and feed to solely inserted DeQuantizeLinear node. If set to True, the floating-point weights will remain and both QuantizeLinear / DeQuantizeLinear nodes will be inserted.qdq_dedicated_pair (
bool
, defaults toFalse
) — When inserting QDQ pair, multiple nodes can share a single QDQ pair as their inputs. If True, it will create an identical and dedicated QDQ pair for each node.qdq_op_type_per_channel_support_to_axis (
Dict[str, int]
) — Set the channel axis for a specific operator type. Effective only when per channel quantization is supported andper_channel
is set to True.
QuantizationConfig is the configuration class handling all the ONNX Runtime quantization parameters.
AutoQuantizationConfig
class optimum.onnxruntime.AutoQuantizationConfig
( )
arm64
( is_static: booluse_symmetric_activations: bool = Falseuse_symmetric_weights: bool = Trueper_channel: bool = Truenodes_to_quantize: typing.Optional[typing.List[str]] = Nonenodes_to_exclude: typing.Optional[typing.List[str]] = Noneoperators_to_quantize: typing.Optional[typing.List[str]] = None )
Parameters
is_static (
bool
) — Boolean flag to indicate whether we target static or dynamic quantization.use_symmetric_activations (
bool
, defaults toFalse
) — Whether to use symmetric quantization for activations.use_symmetric_weights (
bool
, defaults toTrue
) — Whether to use symmetric quantization for weights.per_channel (
bool
, defaults toTrue
) — Whether we should quantize per-channel (also known as “per-row”). Enabling this can increase overall accuracy while making the quantized model heavier.nodes_to_quantize (
Optional[List[str]]
, defaults toNone
) — Specific nodes to quantize. IfNone
, all nodes being operators fromoperators_to_quantize
will be quantized.nodes_to_exclude (
Optional[List[str]]
, defaults toNone
) — Specific nodes to exclude from quantization. The list of nodes in a model can be found loading the ONNX model through onnx.load, or through visual inspection with netron.operators_to_quantize (
Optional[List[str]]
, defaults toNone
) — Type of nodes to perform quantization on. By default, all the quantizable operators will be quantized. Quantizable operators can be found at https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/registry.py.
Creates a QuantizationConfig fit for ARM64.
avx2
( is_static: booluse_symmetric_activations: bool = Falseuse_symmetric_weights: bool = Trueper_channel: bool = Truereduce_range: bool = Falsenodes_to_quantize: typing.Optional[typing.List[str]] = Nonenodes_to_exclude: typing.Optional[typing.List[str]] = Noneoperators_to_quantize: typing.Optional[typing.List[str]] = None )
Parameters
is_static (
bool
) — Boolean flag to indicate whether we target static or dynamic quantization.use_symmetric_activations (
bool
, defaults toFalse
) — Whether to use symmetric quantization for activations.use_symmetric_weights (
bool
, defaults toTrue
) — Whether to use symmetric quantization for weights.per_channel (
bool
, defaults toTrue
) — Whether we should quantize per-channel (also known as “per-row”). Enabling this can increase overall accuracy while making the quantized model heavier.reduce_range (
bool
, defaults toFalse
) — Indicate whether to use 8-bits integers (False) or reduce-range 7-bits integers (True). As a baseline, it is always recommended testing with full range (reduce_range = False) and then, if accuracy drop is significant, to try with reduced range (reduce_range = True). Intel’s CPUs using AVX512 (non VNNI) can suffer from saturation issue when invoking the VPMADDUBSW instruction. To counter this, one should use 7-bits rather than 8-bits integers.nodes_to_quantize (
Optional[List[str]]
, defaults toNone
) — Specific nodes to quantize. IfNone
, all nodes being operators fromoperators_to_quantize
will be quantized.nodes_to_exclude (
Optional[List[str]]
, defaults toNone
) — Specific nodes to exclude from quantization. The list of nodes in a model can be found loading the ONNX model through onnx.load, or through visual inspection with netron.operators_to_quantize (
Optional[List[str]]
, defaults toNone
) — Type of nodes to perform quantization on. By default, all the quantizable operators will be quantized. Quantizable operators can be found at https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/registry.py.
Creates a QuantizationConfig fit for CPU with AVX2 instruction set.
avx512
( is_static: booluse_symmetric_activations: bool = Falseuse_symmetric_weights: bool = Trueper_channel: bool = Truereduce_range: bool = Falsenodes_to_quantize: typing.Optional[typing.List[str]] = Nonenodes_to_exclude: typing.Optional[typing.List[str]] = Noneoperators_to_quantize: typing.Optional[typing.List[str]] = None )
Parameters
is_static (
bool
) — Boolean flag to indicate whether we target static or dynamic quantization.use_symmetric_activations (
bool
, defaults toFalse
) — Whether to use symmetric quantization for activations.use_symmetric_weights (
bool
, defaults toTrue
) — Whether to use symmetric quantization for weights.per_channel (
bool
, defaults toTrue
) — Whether we should quantize per-channel (also known as “per-row”). Enabling this can increase overall accuracy while making the quantized model heavier.reduce_range (
bool
, defaults toFalse
) — Indicate whether to use 8-bits integers (False) or reduce-range 7-bits integers (True). As a baseline, it is always recommended testing with full range (reduce_range = False) and then, if accuracy drop is significant, to try with reduced range (reduce_range = True). Intel’s CPUs using AVX512 (non VNNI) can suffer from saturation issue when invoking the VPMADDUBSW instruction. To counter this, one should use 7-bits rather than 8-bits integers.nodes_to_quantize (
Optional[List[str]]
, defaults toNone
) — Specific nodes to quantize. IfNone
, all nodes being operators fromoperators_to_quantize
will be quantized.nodes_to_exclude (
Optional[List[str]]
, defaults toNone
) — Specific nodes to exclude from quantization. The list of nodes in a model can be found loading the ONNX model through onnx.load, or through visual inspection with netron.operators_to_quantize (
Optional[List[str]]
, defaults toNone
) — Type of nodes to perform quantization on. By default, all the quantizable operators will be quantized. Quantizable operators can be found at https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/registry.py.
Creates a QuantizationConfig fit for CPU with AVX512 instruction set.
avx512_vnni
( is_static: booluse_symmetric_activations: bool = Falseuse_symmetric_weights: bool = Trueper_channel: bool = Truenodes_to_quantize: typing.Optional[typing.List[str]] = Nonenodes_to_exclude: typing.Optional[typing.List[str]] = Noneoperators_to_quantize: typing.Optional[typing.List[str]] = None )
Parameters
is_static (
bool
) — Boolean flag to indicate whether we target static or dynamic quantization.use_symmetric_activations (
bool
, defaults toFalse
) — Whether to use symmetric quantization for activations.use_symmetric_weights (
bool
, defaults toTrue
) — Whether to use symmetric quantization for weights.per_channel (
bool
, defaults toTrue
) — Whether we should quantize per-channel (also known as “per-row”). Enabling this can increase overall accuracy while making the quantized model heavier.nodes_to_quantize (
Optional[List[str]]
, defaults toNone
) — Specific nodes to quantize. IfNone
, all nodes being operators fromoperators_to_quantize
will be quantized.nodes_to_exclude (
Optional[List[str]]
, defaults toNone
) — Specific nodes to exclude from quantization. The list of nodes in a model can be found loading the ONNX model through onnx.load, or through visual inspection with netron.operators_to_quantize (
Optional[List[str]]
, defaults toNone
) — Type of nodes to perform quantization on. By default, all the quantizable operators will be quantized. Quantizable operators can be found at https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/registry.py.
Creates a QuantizationConfig fit for CPU with AVX512-VNNI instruction set.
When targeting Intel AVX512-VNNI CPU underlying execution engine leverage the CPU instruction VPDPBUSD to compute \i32 += i8(w) * u8(x)\ within a single instruction.
AVX512-VNNI (AVX512 Vector Neural Network Instruction) is an x86 extension Instruction set and is a part of the AVX-512 ISA.
AVX512 VNNI is designed to accelerate convolutional neural network for INT8 inference.
tensorrt
( per_channel: bool = Truenodes_to_quantize: typing.Optional[typing.List[str]] = Nonenodes_to_exclude: typing.Optional[typing.List[str]] = Noneoperators_to_quantize: typing.Optional[typing.List[str]] = None )
Parameters
per_channel (
bool
, defaults toTrue
) — Whether we should quantize per-channel (also known as “per-row”). Enabling this can increase overall accuracy while making the quantized model heavier.nodes_to_quantize (
Optional[List[str]]
, defaults toNone
) — Specific nodes to quantize. IfNone
, all nodes being operators fromoperators_to_quantize
will be quantized.nodes_to_exclude (
Optional[List[str]]
, defaults toNone
) — Specific nodes to exclude from quantization. The list of nodes in a model can be found loading the ONNX model through onnx.load, or through visual inspection with netron.operators_to_quantize (
Optional[List[str]]
, defaults toNone
) — Type of nodes to perform quantization on. By default, all the quantizable operators will be quantized. Quantizable operators can be found at https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/registry.py.
Creates a QuantizationConfig fit for TensorRT static quantization, targetting NVIDIA GPUs.
CalibrationConfig
class optimum.onnxruntime.CalibrationConfig
( dataset_name: strdataset_config_name: strdataset_split: strdataset_num_samples: intmethod: CalibrationMethodnum_bins: typing.Optional[int] = Nonenum_quantized_bins: typing.Optional[int] = Nonepercentile: typing.Optional[float] = Nonemoving_average: typing.Optional[bool] = Noneaveraging_constant: typing.Optional[float] = None )
Parameters
dataset_name (
str
) — The name of the calibration dataset.dataset_config_name (
str
) — The name of the calibration dataset configuration.dataset_split (
str
) — Which split of the dataset is used to perform the calibration step.dataset_num_samples (
int
) — The number of samples composing the calibration dataset.method (
CalibrationMethod
) — The method chosen to calculate the activations quantization parameters using the calibration dataset.num_bins (
Optional[int]
, defaults toNone
) — The number of bins to use when creating the histogram when performing the calibration step using the Percentile or Entropy method.num_quantized_bins (
Optional[int]
, defaults toNone
) — The number of quantized bins to use when performing the calibration step using the Entropy method.percentile (
Optional[float]
, defaults toNone
) — The percentile to use when computing the activations quantization ranges when performing the calibration step using the Percentile method.moving_average (
Optional[bool]
, defaults toNone
) — Whether to compute the moving average of the minimum and maximum values when performing the calibration step using the MinMax method.averaging_constant (
Optional[float]
, defaults toNone
) — The constant smoothing factor to use when computing the moving average of the minimum and maximum values. Effective only when the MinMax calibration method is selected andmoving_average
is set to True.
CalibrationConfig is the configuration class handling all the ONNX Runtime parameters related to the calibration step of static quantization.
ORTConfig
class optimum.onnxruntime.ORTConfig
( opset: typing.Optional[int] = Noneuse_external_data_format: bool = Falseone_external_file: bool = Trueoptimization: typing.Optional[optimum.onnxruntime.configuration.OptimizationConfig] = Nonequantization: typing.Optional[optimum.onnxruntime.configuration.QuantizationConfig] = None**kwargs )
Parameters
opset (
Optional[int]
, defaults toNone
) — ONNX opset version to export the model with.use_external_data_format (
bool
, defaults toFalse
) — Allow exporting model >= than 2Gb.one_external_file (
bool
, defaults toTrue
) — Whenuse_external_data_format=True
, whether to save all tensors to one external file. If false, save each tensor to a file named with the tensor name. (Can not be set toFalse
for the quantization)optimization (
Optional[OptimizationConfig]
, defaults toNone
) — Specify a configuration to optimize ONNX Runtime modelquantization (
Optional[QuantizationConfig]
, defaults toNone
) — Specify a configuration to quantize ONNX Runtime model
ORTConfig is the configuration class handling all the ONNX Runtime parameters related to the ONNX IR model export, optimization and quantization parameters.
Last updated