Inference on CPU
Last updated
Last updated
This guide focuses on inferencing large models efficiently on CPU.
We have recently integrated BetterTransformer
for faster inference on CPU for text, image and audio models. Check the documentation about this integration for more details.
TorchScript is a way to create serializable and optimizable models from PyTorch code. Any TorchScript program can be saved from a Python process and loaded in a process where there is no Python dependency. Comparing to default eager mode, jit mode in PyTorch normally yields better performance for model inference from optimization methodologies like operator fusion.
For a gentle introduction to TorchScript, see the Introduction to .
Intelยฎ Extension for PyTorch provides further optimizations in jit mode for Transformers series models. It is highly recommended for users to take advantage of Intelยฎ Extension for PyTorch with jit mode. Some frequently used operator patterns from Transformers models are already supported in Intelยฎ Extension for PyTorch with jit mode fusions. Those fusion patterns like Multi-head-attention fusion, Concat Linear, Linear+Add, Linear+Gelu, Add+LayerNorm fusion and etc. are enabled and perform well. The benefit of the fusion is delivered to users in a transparent fashion. According to the analysis, ~70% of most popular NLP tasks in question-answering, text-classification, and token-classification can get performance benefits with these fusion patterns for both Float32 precision and BFloat16 Mixed precision.
Check more detailed information for .
IPEX installation:
IPEX release is following PyTorch, check the approaches for .
To enable JIT-mode in Trainer for evaluaion or prediction, users should add jit_mode_eval
in Trainer command arguments.
for PyTorch >= 1.14.0. JIT-mode could benefit any models for prediction and evaluaion since dict input is supported in jit.trace
for PyTorch < 1.14.0. JIT-mode could benefit models whose forward parameter order matches the tuple input order in jit.trace, like question-answering model In the case where the forward parameter order does not match the tuple input order in jit.trace, like text-classification models, jit.trace will fail and we are capturing this with the exception here to make it fallback. Logging is used to notify users.
Inference using jit mode on CPU:
Inference with IPEX using jit mode on CPU:
Take an example of the use cases on