How to accelerate training
Last updated
Last updated
Optimum integrates ONNX Runtime Training through an ORTTrainer
API that extends Trainer
in Transformers. With this extension, training time can be reduced by more than 35% for many popular BOINC AI models compared to PyTorch under eager mode.
ORTTrainer
and ORTSeq2SeqTrainer
APIs make it easy to compose ONNX Runtime (ORT) with other features in Trainer
. It contains feature-complete training loop and evaluation loop, and supports hyperparameter search, mixed-precision training and distributed training with multiple NVIDIA and AMD GPUs. With the ONNX Runtime backend, ORTTrainer
and ORTSeq2SeqTrainer
take advantage of:
Computation graph optimizations: constant foldings, node eliminations, node fusions
Efficient memory planning
Kernel optimization
ORT fused Adam optimizer: batches the elementwise updates applied to all the model’s parameters into one or a few kernel launches
More efficient FP16 optimizer: eliminates a great deal of device to host memory copies
Mixed precision training
Test it out to achieve lower latency, higher throughput, and larger maximum batch size while training models in 🌍 Transformers!
The chart below shows impressive acceleration from 39% to 130% for BOINC AImodels with Optimum when using ONNX Runtime and DeepSpeed ZeRO Stage 1 for training. The performance measurements were done on selected BOINC AI models with PyTorch as the baseline run, only ONNX Runtime for training as the second run, and ONNX Runtime + DeepSpeed ZeRO Stage 1 as the final run, showing maximum gains. The Optimizer used for the baseline PyTorch runs is the AdamW optimizer and the ORT Training runs use the Fused Adam Optimizer(available in ORTTrainingArguments
). The runs were performed on a single Nvidia A100 node with 8 GPUs.
The version information used for these runs is as follows:
Copied
To use ONNX Runtime for training, you need a machine with at least one NVIDIA or AMD GPU.
To use ORTTrainer
or ORTSeq2SeqTrainer
, you need to install ONNX Runtime Training module and Optimum.
To set up the environment, we strongly recommend you install the dependencies with Docker to ensure that the versions are correct and well configured. You can find dockerfiles with various combinations here.
Here below we take the installation of onnxruntime-training 1.14.0
as an example:
If you want to install onnxruntime-training 1.14.0
via Dockerfile:
Copied
Copied
And run post-installation configuration:
Copied
You can install Optimum via pypi:
Copied
Or install from source:
Copied
This command installs the current main dev version of Optimum, which could include latest developments(new features, bug fixes). However, the main version might not be very stable. If you run into any problem, please open an issue so that we can fix it as soon as possible.
The ORTTrainer
class inherits the Trainer
of Transformers. You can easily adapt the codes by replacing Trainer
of transformers with ORTTrainer
to take advantage of the acceleration empowered by ONNX Runtime. Here is an example of how to use ORTTrainer
compared with Trainer
:
Copied
Check out more detailed example scripts in the optimum repository.
The ORTSeq2SeqTrainer
class is similar to the Seq2SeqTrainer
of Transformers. You can easily adapt the codes by replacing Seq2SeqTrainer
of transformers with ORTSeq2SeqTrainer
to take advantage of the acceleration empowered by ONNX Runtime. Here is an example of how to use ORTSeq2SeqTrainer
compared with Seq2SeqTrainer
:
Copied
Check out more detailed example scripts in the optimum repository.
The ORTTrainingArguments
class inherits the TrainingArguments
class in Transformers. Besides the optimizers implemented in Transformers, it allows you to use the optimizers implemented in ONNX Runtime. Replace Seq2SeqTrainingArguments
with ORTSeq2SeqTrainingArguments
:
Copied
DeepSpeed is supported by ONNX Runtime(only ZeRO stage 1 and 2 for the moment). You can find some DeepSpeed configuration examples in the Optimum repository.
The ORTSeq2SeqTrainingArguments
class inherits the Seq2SeqTrainingArguments
class in Transformers. Besides the optimizers implemented in Transformers, it allows you to use the optimizers implemented in ONNX Runtime. Replace Seq2SeqTrainingArguments
with ORTSeq2SeqTrainingArguments
:
Copied
DeepSpeed is supported by ONNX Runtime(only ZeRO stage 1 and 2 for the moment). You can find some DeepSpeed configuration examples in the Optimum repository.
Optimum supports accelerating BOINC AI Diffusers with ONNX Runtime in this example. The core changes required to enable ONNX Runtime Training are summarized below:
Copied
If you have any problems or questions regarding ORTTrainer
, please file an issue with Optimum Github or discuss with us on BOINC AI’s community forum, cheers 🌍 !