Accelerate
  • ๐ŸŒGETTING STARTED
    • BOINC AI Accelerate
    • Installation
    • Quicktour
  • ๐ŸŒTUTORIALS
    • Overview
    • Migrating to BOINC AI Accelerate
    • Launching distributed code
    • Launching distributed training from Jupyter Notebooks
  • ๐ŸŒHOW-TO GUIDES
    • Start Here!
    • Example Zoo
    • How to perform inference on large models with small resources
    • Knowing how big of a model you can fit into memory
    • How to quantize model
    • How to perform distributed inference with normal resources
    • Performing gradient accumulation
    • Accelerating training with local SGD
    • Saving and loading training states
    • Using experiment trackers
    • Debugging timeout errors
    • How to avoid CUDA Out-of-Memory
    • How to use Apple Silicon M1 GPUs
    • How to use DeepSpeed
    • How to use Fully Sharded Data Parallelism
    • How to use Megatron-LM
    • How to use BOINC AI Accelerate with SageMaker
    • How to use BOINC AI Accelerate with Intelยฎ Extension for PyTorch for cpu
  • ๐ŸŒCONCEPTS AND FUNDAMENTALS
    • BOINC AI Accelerate's internal mechanism
    • Loading big models into memory
    • Comparing performance across distributed setups
    • Executing and deferring jobs
    • Gradient synchronization
    • TPU best practices
  • ๐ŸŒREFERENCE
    • Main Accelerator class
    • Stateful configuration classes
    • The Command Line
    • Torch wrapper classes
    • Experiment trackers
    • Distributed launchers
    • DeepSpeed utilities
    • Logging
    • Working with large models
    • Kwargs handlers
    • Utility functions and classes
    • Megatron-LM Utilities
    • Fully Sharded Data Parallelism Utilities
Powered by GitBook
On this page
  • Amazon SageMaker
  • Getting Started
  • Advanced Features
  1. HOW-TO GUIDES

How to use BOINC AI Accelerate with SageMaker

PreviousHow to use Megatron-LMNextHow to use BOINC AI Accelerate with Intelยฎ Extension for PyTorch for cpu

Last updated 1 year ago

Amazon SageMaker

BOINC AI and Amazon introduced new to make it easier than ever to train BOINC AI Transformer models in .

Getting Started

Setup & Installation

Before you can run your ๐ŸŒ Accelerate scripts on Amazon SageMaker you need to sign up for an AWS account. If you do not have an AWS account yet learn more .

After you have your AWS Account you need to install the sagemaker sdk for ๐ŸŒ Accelerate with:

Copied

pip install "accelerate[sagemaker]" --upgrade

๐ŸŒ Accelerate currently uses the ๐ŸŒ DLCs, with transformers, datasets and tokenizers pre-installed. ๐ŸŒ Accelerate is not in the DLC yet (will soon be added!) so to use it within Amazon SageMaker you need to create a requirements.txt in the same directory where your training script is located and add it as dependency:

Copied

accelerate

You should also add any other dependencies you have to this requirements.txt.

Configure ๐ŸŒ Accelerate

You can configure the launch configuration for Amazon SageMaker the same as you do for non SageMaker training jobs with the ๐ŸŒ Accelerate CLI:

Copied

accelerate config
# In which compute environment are you running? ([0] This machine, [1] AWS (Amazon SageMaker)): 1

๐ŸŒ Accelerate will go through a questionnaire about your Amazon SageMaker setup and create a config file you can edit.

๐ŸŒ Accelerate is not saving any of your credentials.

Prepare a ๐ŸŒ Accelerate fine-tuning script

The training script is very similar to a training script you might run outside of SageMaker, but to save your model after training you need to specify either /opt/ml/model or use os.environ["SM_MODEL_DIR"] as your save directory. After training, artifacts in this directory are uploaded to S3:

Copied

- torch.save('/opt/ml/model`)
+ accelerator.save('/opt/ml/model')

Launch Training

You can launch your training with ๐ŸŒ Accelerate CLI with:

Copied

accelerate launch path_to_script.py --args_to_the_script

This will launch your training script using your configuration. The only thing you have to do is provide all the arguments needed by your training script as named arguments.

Examples

If you run one of the example scripts, donโ€™t forget to add accelerator.save('/opt/ml/model') to it.

Copied

accelerate launch ./examples/sagemaker_example.py

Outputs:

Copied

Configuring Amazon SageMaker environment
Converting Arguments to Hyperparameters
Creating Estimator
2021-04-08 11:56:50 Starting - Starting the training job...
2021-04-08 11:57:13 Starting - Launching requested ML instancesProfilerReport-1617883008: InProgress
.........
2021-04-08 11:58:54 Starting - Preparing the instances for training.........
2021-04-08 12:00:24 Downloading - Downloading input data
2021-04-08 12:00:24 Training - Downloading the training image..................
2021-04-08 12:03:39 Training - Training image download completed. Training in progress..
........
epoch 0: {'accuracy': 0.7598039215686274, 'f1': 0.8178438661710037}
epoch 1: {'accuracy': 0.8357843137254902, 'f1': 0.882249560632689}
epoch 2: {'accuracy': 0.8406862745098039, 'f1': 0.8869565217391304}
........
2021-04-08 12:05:40 Uploading - Uploading generated training model
2021-04-08 12:05:40 Completed - Training job completed
Training seconds: 331
Billable seconds: 331
You can find your model data at: s3://your-bucket/accelerate-sagemaker-1-2021-04-08-11-56-47-108/output/model.tar.gz

Advanced Features

Distributed Training: Data Parallelism

Set up the accelerate config by running accelerate config and answer the SageMaker questions and set it up. To use SageMaker DDP, select it when asked What is the distributed mode? ([0] No distributed training, [1] data parallelism):. Example config below:

Copied

base_job_name: accelerate-sagemaker-1
compute_environment: AMAZON_SAGEMAKER
distributed_type: DATA_PARALLEL
ec2_instance_type: ml.p3.16xlarge
iam_role_name: xxxxx
image_uri: null
mixed_precision: fp16
num_machines: 1
profile: xxxxx
py_version: py38
pytorch_version: 1.10.2
region: us-east-1
transformers_version: 4.17.0
use_cpu: false

Distributed Training: Model Parallelism

currently in development, will be supported soon.

Python packages and dependencies

๐ŸŒ Accelerate currently uses the ๐ŸŒ DLCs, with transformers, datasets and tokenizers pre-installed. If you want to use different/other Python packages you can do this by adding them to the requirements.txt. These packages will be installed before your training script is started.

Local Training: SageMaker Local mode

The local mode in the SageMaker SDK allows you to run your training script locally inside the BOINC AI DLC (Deep Learning container) or using your custom container image. This is useful for debugging and testing your training script inside the final container environment. Local mode uses Docker compose (Note: Docker Compose V2 is not supported yet). The SDK will handle the authentication against ECR to pull the DLC to your local environment. You can emulate CPU (single and multi-instance) and GPU (single instance) SageMaker training jobs.

To use local mode, you need to set your ec2_instance_type to local.

Copied

ec2_instance_type: local

Advanced configuration

Copied

additional_args:
  # enable network isolation to restrict internet access for containers
  enable_network_isolation: True

Use Spot Instances

Copied

additional_args:
  use_spot_instances: True
  max_wait: 86400

Note: Spot Instances are subject to be terminated and training to be continued from a checkpoint. This is not handled in ๐ŸŒ Accelerate out of the box. Contact us if you would like this feature.

Remote scripts: Use scripts located on Github

undecided if feature is needed. Contact us if you would like this feature.

SageMaker doesnโ€™t support argparse actions. If you want to use, for example, boolean hyperparameters, you need to specify type as bool in your script and provide an explicit True or False value for this hyperparameter. .

The configuration allows you to override parameters for the . These settings have to be applied in the config file and are not part of accelerate config. You can control many additional aspects of the training job, e.g. use Spot instances, enable network isolation and many more.

You can find all available configuration .

You can use Spot Instances e.g. using (see ):

๐ŸŒ
BOINC AI Deep Learning Containers (DLCs)
Amazon SageMaker
here
[REF]
Estimator
here
Advanced configuration