How to use BOINC AI Accelerate with SageMaker
Amazon SageMaker
BOINC AI and Amazon introduced new BOINC AI Deep Learning Containers (DLCs) to make it easier than ever to train BOINC AI Transformer models in Amazon SageMaker.
Getting Started
Setup & Installation
Before you can run your ๐ Accelerate scripts on Amazon SageMaker you need to sign up for an AWS account. If you do not have an AWS account yet learn more here.
After you have your AWS Account you need to install the sagemaker
sdk for ๐ Accelerate with:
Copied
๐ Accelerate currently uses the ๐ DLCs, with transformers
, datasets
and tokenizers
pre-installed. ๐ Accelerate is not in the DLC yet (will soon be added!) so to use it within Amazon SageMaker you need to create a requirements.txt
in the same directory where your training script is located and add it as dependency:
Copied
You should also add any other dependencies you have to this requirements.txt
.
Configure ๐ Accelerate
You can configure the launch configuration for Amazon SageMaker the same as you do for non SageMaker training jobs with the ๐ Accelerate CLI:
Copied
๐ Accelerate will go through a questionnaire about your Amazon SageMaker setup and create a config file you can edit.
๐ Accelerate is not saving any of your credentials.
Prepare a ๐ Accelerate fine-tuning script
The training script is very similar to a training script you might run outside of SageMaker, but to save your model after training you need to specify either /opt/ml/model
or use os.environ["SM_MODEL_DIR"]
as your save directory. After training, artifacts in this directory are uploaded to S3:
Copied
SageMaker doesnโt support argparse actions. If you want to use, for example, boolean hyperparameters, you need to specify type as bool in your script and provide an explicit True or False value for this hyperparameter. [REF].
Launch Training
You can launch your training with ๐ Accelerate CLI with:
Copied
This will launch your training script using your configuration. The only thing you have to do is provide all the arguments needed by your training script as named arguments.
Examples
If you run one of the example scripts, donโt forget to add accelerator.save('/opt/ml/model')
to it.
Copied
Outputs:
Copied
Advanced Features
Distributed Training: Data Parallelism
Set up the accelerate config by running accelerate config
and answer the SageMaker questions and set it up. To use SageMaker DDP, select it when asked What is the distributed mode? ([0] No distributed training, [1] data parallelism):
. Example config below:
Copied
Distributed Training: Model Parallelism
currently in development, will be supported soon.
Python packages and dependencies
๐ Accelerate currently uses the ๐ DLCs, with transformers
, datasets
and tokenizers
pre-installed. If you want to use different/other Python packages you can do this by adding them to the requirements.txt
. These packages will be installed before your training script is started.
Local Training: SageMaker Local mode
The local mode in the SageMaker SDK allows you to run your training script locally inside the BOINC AI DLC (Deep Learning container) or using your custom container image. This is useful for debugging and testing your training script inside the final container environment. Local mode uses Docker compose (Note: Docker Compose V2 is not supported yet). The SDK will handle the authentication against ECR to pull the DLC to your local environment. You can emulate CPU (single and multi-instance) and GPU (single instance) SageMaker training jobs.
To use local mode, you need to set your ec2_instance_type
to local
.
Copied
Advanced configuration
The configuration allows you to override parameters for the Estimator. These settings have to be applied in the config file and are not part of accelerate config
. You can control many additional aspects of the training job, e.g. use Spot instances, enable network isolation and many more.
Copied
You can find all available configuration here.
Use Spot Instances
You can use Spot Instances e.g. using (see Advanced configuration):
Copied
Note: Spot Instances are subject to be terminated and training to be continued from a checkpoint. This is not handled in ๐ Accelerate out of the box. Contact us if you would like this feature.
Remote scripts: Use scripts located on Github
undecided if feature is needed. Contact us if you would like this feature.
Last updated