BOINC AI Benchmarking tools are deprecated and it is advised to use external Benchmarking libraries to measure the speed and memory complexity of Transformer models.
Let’s take a look at how 🌍 Transformers models can be benchmarked, best practices, and already available benchmarks.
A notebook explaining in more detail how to benchmark 🌍 Transformers models can be found here.
How to benchmark 🌍 Transformers models
The classes PyTorchBenchmark and TensorFlowBenchmark allow to flexibly benchmark 🌍 Transformers models. The benchmark classes allow us to measure the peak memory usage and required time for both inference and training.
Hereby, inference is defined by a single forward pass, and training is defined by a single forward pass and backward pass.
The benchmark classes PyTorchBenchmark and TensorFlowBenchmark expect an object of type PyTorchBenchmarkArguments and TensorFlowBenchmarkArguments, respectively, for instantiation. PyTorchBenchmarkArguments and TensorFlowBenchmarkArguments are data classes and contain all relevant configurations for their corresponding benchmark class. In the following example, it is shown how a BERT model of type bert-base-cased can be benchmarked.
Here, three arguments are given to the benchmark argument data classes, namely models, batch_sizes, and sequence_lengths. The argument models is required and expects a list of model identifiers from the model hub The list arguments batch_sizes and sequence_lengths define the size of the input_ids on which the model is benchmarked. There are many more parameters that can be configured via the benchmark argument data classes. For more detail on these one can either directly consult the files src/transformers/benchmark/benchmark_args_utils.py, src/transformers/benchmark/benchmark_args.py (for PyTorch) and src/transformers/benchmark/benchmark_args_tf.py (for Tensorflow). Alternatively, running the following shell commands from root will print out a descriptive list of all configurable parameters for PyTorch and Tensorflow respectively.
By default, the time and the required memory for inference are benchmarked. In the example output above the first two sections show the result corresponding to inference time and inference memory. In addition, all relevant information about the computing environment, e.g. the GPU type, the system, the library versions, etc… are printed out in the third section under ENVIRONMENT INFORMATION. This information can optionally be saved in a .csv file when adding the argument save_to_csv=True to PyTorchBenchmarkArguments and TensorFlowBenchmarkArguments respectively. In this case, every section is saved in a separate .csv file. The path to each .csv file can optionally be defined via the argument data classes.
Instead of benchmarking pre-trained models via their model identifier, e.g.bert-base-uncased, the user can alternatively benchmark an arbitrary configuration of any available model class. In this case, a list of configurations must be inserted with the benchmark args as follows.
Again, inference time and required memory for inference are measured, but this time for customized configurations of the BertModel class. This feature can especially be helpful when deciding for which configuration the model should be trained.
Benchmark best practices
This section lists a couple of best practices one should be aware of when benchmarking a model.
Currently, only single device benchmarking is supported. When benchmarking on GPU, it is recommended that the user specifies on which device the code should be run by setting the CUDA_VISIBLE_DEVICES environment variable in the shell, e.g.export CUDA_VISIBLE_DEVICES=0 before running the code.
The option no_multi_processing should only be set to True for testing and debugging. To ensure accurate memory measurement it is recommended to run each memory benchmark in a separate process by making sure no_multi_processing is set to True.
One should always state the environment information when sharing the results of a model benchmark. Results can vary heavily between different GPU devices, library versions, etc., so that benchmark results on their own are not very useful for the community.
Sharing your benchmark
Previously all available core models (10 at the time) have been benchmarked for inference time, across many different settings: using PyTorch, with and without TorchScript, using TensorFlow, with and without XLA. All of those tests were done across CPUs (except for TensorFlow XLA) and GPUs.