Train with a script

Along with the 🌎 Transformers notebooks, there are also example scripts demonstrating how to train a model for a task with PyTorch, TensorFlow, or JAX/Flax.

You will also find scripts we’ve used in our research projects and legacy examples which are mostly community contributed. These scripts are not actively maintained and require a specific version of 🌎 Transformers that will most likely be incompatible with the latest version of the library.

The example scripts are not expected to work out-of-the-box on every problem, and you may need to adapt the script to the problem you’re trying to solve. To help you with this, most of the scripts fully expose how data is preprocessed, allowing you to edit it as necessary for your use case.

For any feature you’d like to implement in an example script, please discuss it on the forum or in an issue before submitting a Pull Request. While we welcome bug fixes, it is unlikely we will merge a Pull Request that adds more functionality at the cost of readability.

This guide will show you how to run an example summarization training script in PyTorch and TensorFlow. All examples are expected to work with both frameworks unless otherwise specified.

Setup

To successfully run the latest version of the example scripts, you have to install 🌎Transformers from source in a new virtual environment:

Copied

git clone https://github.com/boincai/transformers
cd transformers
pip install .

For older versions of the example scripts, click on the toggle below:

Examples for older versions of 🌎 Transformers

Then switch your current clone of 🌎 Transformers to a specific version, like v3.5.1 for example:

Copied

git checkout tags/v3.5.1

After you’ve setup the correct library version, navigate to the example folder of your choice and install the example specific requirements:

Copied

pip install -r requirements.txt

Run a script

PytorchHide Pytorch content

The example script downloads and preprocesses a dataset from the 🌎 Datasets library. Then the script fine-tunes a dataset with the Trainer on an architecture that supports summarization. The following example shows how to fine-tune T5-small on the CNN/DailyMail dataset. The T5 model requires an additional source_prefix argument due to how it was trained. This prompt lets T5 know this is a summarization task.

Copied

python examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate

TensorFlowHide TensorFlow content

The example script downloads and preprocesses a dataset from the 🌎 Datasets library. Then the script fine-tunes a dataset using Keras on an architecture that supports summarization. The following example shows how to fine-tune T5-small on the CNN/DailyMail dataset. The T5 model requires an additional source_prefix argument due to how it was trained. This prompt lets T5 know this is a summarization task.

Copied

python examples/tensorflow/summarization/run_summarization.py  \
    --model_name_or_path t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --output_dir /tmp/tst-summarization  \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 16 \
    --num_train_epochs 3 \
    --do_train \
    --do_eval

Distributed training and mixed precision

The Trainer supports distributed training and mixed precision, which means you can also use it in a script. To enable both of these features:

Add the fp16 argument to enable mixed precision.
Set the number of GPUs to use with the nproc_per_node argument.

Copied

python -m torch.distributed.launch \
    --nproc_per_node 8 pytorch/summarization/run_summarization.py \
    --fp16 \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate

TensorFlow scripts utilize a MirroredStrategy for distributed training, and you don’t need to add any additional arguments to the training script. The TensorFlow script will use multiple GPUs by default if they are available.

Run a script on a TPU

PytorchHide Pytorch content

Tensor Processing Units (TPUs) are specifically designed to accelerate performance. PyTorch supports TPUs with the XLA deep learning compiler (see here for more details). To use a TPU, launch the xla_spawn.py script and use the num_cores argument to set the number of TPU cores you want to use.

Copied

python xla_spawn.py --num_cores 8 \
    summarization/run_summarization.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate

TensorFlowHide TensorFlow content

Tensor Processing Units (TPUs) are specifically designed to accelerate performance. TensorFlow scripts utilize a TPUStrategy for training on TPUs. To use a TPU, pass the name of the TPU resource to the tpu argument.

Copied

python run_summarization.py  \
    --tpu name_of_tpu_resource \
    --model_name_or_path t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --output_dir /tmp/tst-summarization  \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 16 \
    --num_train_epochs 3 \
    --do_train \
    --do_eval

Run a script with 🌎 Accelerate

🌎 Accelerate is a PyTorch-only library that offers a unified method for training a model on several types of setups (CPU-only, multiple GPUs, TPUs) while maintaining complete visibility into the PyTorch training loop. Make sure you have 🌎 Accelerate installed if you don’t already have it:

Note: As Accelerate is rapidly developing, the git version of accelerate must be installed to run the scripts
Copied
pip install git+https://github.com/boincai/accelerate

Instead of the run_summarization.py script, you need to use the run_summarization_no_trainer.py script. 🌎 Accelerate supported scripts will have a task_no_trainer.py file in the folder. Begin by running the following command to create and save a configuration file:

Copied

accelerate config

Test your setup to make sure it is configured correctly:

Copied

accelerate test

Now you are ready to launch the training:

Copied

accelerate launch run_summarization_no_trainer.py \
    --model_name_or_path t5-small \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir ~/tmp/tst-summarization

Use a custom dataset

The summarization script supports custom datasets as long as they are a CSV or JSON Line file. When you use your own dataset, you need to specify several additional arguments:

train_file and validation_file specify the path to your training and validation files.
text_column is the input text to summarize.
summary_column is the target text to output.

A summarization script using a custom dataset would look like this:

Copied

python examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --train_file path_to_csv_or_jsonlines_file \
    --validation_file path_to_csv_or_jsonlines_file \
    --text_column text_column_name \
    --summary_column summary_column_name \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --overwrite_output_dir \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --predict_with_generate

Test a script

It is often a good idea to run your script on a smaller number of dataset examples to ensure everything works as expected before committing to an entire dataset which may take hours to complete. Use the following arguments to truncate the dataset to a maximum number of samples:

max_train_samples
max_eval_samples
max_predict_samples

Copied

python examples/pytorch/summarization/run_summarization.py \
    --model_name_or_path t5-small \
    --max_train_samples 50 \
    --max_eval_samples 50 \
    --max_predict_samples 50 \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate

Not all example scripts support the max_predict_samples argument. If you aren’t sure whether your script supports this argument, add the -h argument to check:

Copied

examples/pytorch/summarization/run_summarization.py -h

Resume training from checkpoint

Another helpful option to enable is resuming training from a previous checkpoint. This will ensure you can pick up where you left off without starting over if your training gets interrupted. There are two methods to resume training from a checkpoint.

The first method uses the output_dir previous_output_dir argument to resume training from the latest checkpoint stored in output_dir. In this case, you should remove overwrite_output_dir:

Copied

python examples/pytorch/summarization/run_summarization.py
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --output_dir previous_output_dir \
    --predict_with_generate

The second method uses the resume_from_checkpoint path_to_specific_checkpoint argument to resume training from a specific checkpoint folder.

Copied

python examples/pytorch/summarization/run_summarization.py
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --resume_from_checkpoint path_to_specific_checkpoint \
    --predict_with_generate

All scripts can upload your final model to the Model Hub. Make sure you are logged into BOINC AI before you begin:

Copied

boincai-cli login

Then add the push_to_hub argument to the script. This argument will create a repository with your BOINC AI username and the folder name specified in output_dir.

To give your repository a specific name, use the push_to_hub_model_id argument to add it. The repository will be automatically listed under your namespace.

The following example shows how to upload a model with a specific repository name:

Copied

python examples/pytorch/summarization/run_summarization.py
    --model_name_or_path t5-small \
    --do_train \
    --do_eval \
    --dataset_name cnn_dailymail \
    --dataset_config "3.0.0" \
    --source_prefix "summarize: " \
    --push_to_hub \
    --push_to_hub_model_id finetuned-t5-cnn_dailymail \
    --output_dir /tmp/tst-summarization \
    --per_device_train_batch_size=4 \
    --per_device_eval_batch_size=4 \
    --overwrite_output_dir \
    --predict_with_generate

PreviousFine-tune a pretrained model NextSet up distributed training with BOINC AI Accelerate

Last updated 1 year ago