Optimum
  • 🌍OVERVIEW
    • Optimum
    • Installation
    • Quick tour
    • Notebooks
    • 🌍CONCEPTUAL GUIDES
      • Quantization
  • 🌍HABANA
    • BOINC AI Optimum Habana
    • Installation
    • Quickstart
    • 🌍TUTORIALS
      • Overview
      • Single-HPU Training
      • Distributed Training
      • Run Inference
      • Stable Diffusion
      • LDM3D
    • 🌍HOW-TO GUIDES
      • Overview
      • Pretraining Transformers
      • Accelerating Training
      • Accelerating Inference
      • How to use DeepSpeed
      • Multi-node Training
    • 🌍CONCEPTUAL GUIDES
      • What are Habana's Gaudi and HPUs?
    • 🌍REFERENCE
      • Gaudi Trainer
      • Gaudi Configuration
      • Gaudi Stable Diffusion Pipeline
      • Distributed Runner
  • 🌍INTEL
    • BOINC AI Optimum Intel
    • Installation
    • 🌍NEURAL COMPRESSOR
      • Optimization
      • Distributed Training
      • Reference
    • 🌍OPENVINO
      • Models for inference
      • Optimization
      • Reference
  • 🌍AWS TRAINIUM/INFERENTIA
    • BOINC AI Optimum Neuron
  • 🌍FURIOSA
    • BOINC AI Optimum Furiosa
    • Installation
    • 🌍HOW-TO GUIDES
      • Overview
      • Modeling
      • Quantization
    • 🌍REFERENCE
      • Models
      • Configuration
      • Quantization
  • 🌍ONNX RUNTIME
    • Overview
    • Quick tour
    • 🌍HOW-TO GUIDES
      • Inference pipelines
      • Models for inference
      • How to apply graph optimization
      • How to apply dynamic and static quantization
      • How to accelerate training
      • Accelerated inference on NVIDIA GPUs
    • 🌍CONCEPTUAL GUIDES
      • ONNX And ONNX Runtime
    • 🌍REFERENCE
      • ONNX Runtime Models
      • Configuration
      • Optimization
      • Quantization
      • Trainer
  • 🌍EXPORTERS
    • Overview
    • The TasksManager
    • 🌍ONNX
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to ONNX
        • Add support for exporting an architecture to ONNX
      • 🌍REFERENCE
        • ONNX configurations
        • Export functions
    • 🌍TFLITE
      • Overview
      • 🌍HOW-TO GUIDES
        • Export a model to TFLite
        • Add support for exporting an architecture to TFLite
      • 🌍REFERENCE
        • TFLite configurations
        • Export functions
  • 🌍TORCH FX
    • Overview
    • 🌍HOW-TO GUIDES
      • Optimization
    • 🌍CONCEPTUAL GUIDES
      • Symbolic tracer
    • 🌍REFERENCE
      • Optimization
  • 🌍BETTERTRANSFORMER
    • Overview
    • 🌍TUTORIALS
      • Convert Transformers models to use BetterTransformer
      • How to add support for new architectures?
  • 🌍LLM QUANTIZATION
    • GPTQ quantization
  • 🌍UTILITIES
    • Dummy input generators
    • Normalized configurations
Powered by GitBook
On this page
  • Overview
  • Quickstart
  1. BETTERTRANSFORMER

Overview

PreviousBETTERTRANSFORMERNextTUTORIALS

Last updated 1 year ago

Overview

🌍 Optimum provides an API called BetterTransformer, a fast path of standard PyTorch Transformer APIs to benefit from interesting speedups on CPU & GPU through sparsity and fused kernels as Flash Attention. For now, BetterTransformer supports the fastpath from the native as well as Flash Attention and Memory-Efficient Attention from .

Quickstart

Since its 1.13 version, the stable version of a fast path for its standard Transformer APIs that provides out of the box performance improvements for transformer-based models. You can benefit from interesting speedup on most consumer-type devices, including CPUs, older and newer versions of NIVIDIA GPUs. You can now use this feature in🌍Optimum together with Transformers and use it for major models in the BOINC AI ecosystem.

In the 2.0 version, PyTorch includes a native scaled dot-product attention operator (SDPA) as part of torch.nn.functional. This function encompasses several implementations that can be applied depending on the inputs and the hardware in use. See the for more information, and for benchmarks.

We provide an integration with these optimizations out of the box in 🌍 Optimum, so that you can convert any supported 🌍 Transformers model so as to use the optimized paths & scaled_dot_product_attention function when relevant.

The PyTorch-native `scaled_dot_product_attention` operator can only dispatch to Flash Attention if no `attention_mask` is provided.

Thus, by default in training mode, the BetterTransformer integration drops the mask support and can only be used for training that do not require a padding mask for batched training. This is the case for example for masked language modeling or causal language modeling. BetterTransformer is not suited for the fine-tuning of models on tasks that requires a padding mask.

In inference mode, the padding mask is kept for correctness and thus speedups should be expected only in the batch size = 1 case.

Supported models

The list of supported model below:

  • (SantaCoder, StarCoder)

Quick usage

In order to use the BetterTransformer API just run the following commands:

Copied

>>> from transformers import AutoModelForSequenceClassification
>>> from optimum.bettertransformer import BetterTransformer
>>> model_hf = AutoModelForSequenceClassification.from_pretrained("bert-base-cased")
>>> model = BetterTransformer.transform(model_hf, keep_original_model=True)

You can leave keep_original_model=False in case you want to overwrite the current model with its BetterTransformer version.

Let us know by opening an issue in 🌍 Optimum if you want more models to be supported, or check out the if you want to add it by yourself!

More details on tutorials section to deeply understand how to use it, or check the !

🌍
nn.TransformerEncoderLayer
torch.nn.functional.scaled_dot_product_attention
PyTorch released
official documentation
this blog post
AlBERT
Bark
BART
BERT
BERT-generation
BLIP-2
BLOOM
CamemBERT
CLIP
CodeGen
Data2VecText
DistilBert
DeiT
Electra
Ernie
Falcon
FSMT
GPT2
GPT-j
GPT-neo
GPT-neo-x
GPT BigCode
HuBERT
LayoutLM
Llama & Llama2
MarkupLM
Marian
MBart
M2M100
OPT
ProphetNet
RemBERT
RoBERTa
RoCBert
RoFormer
Splinter
Tapas
ViLT
ViT
ViT-MAE
ViT-MSN
Wav2Vec2
Whisper
XLMRoberta
YOLOS
contribution guideline
Google colab demo