Launching distributed training from Jupyter Notebooks

Launching Multi-GPU Training from a Jupyter Environment

This tutorial teaches you how to fine tune a computer vision model with ๐ŸŒ Accelerate from a Jupyter Notebook on a distributed system. You will also learn how to setup a few requirements needed for ensuring your environment is configured properly, your data has been prepared properly, and finally how to launch training.

This tutorial is also available as a Jupyter Notebook here

Configuring the Environment

Before any training can be performed, a ๐ŸŒ Accelerate config file must exist in the system. Usually this can be done by running the following in a terminal and answering the prompts:

Copied

accelerate config

However, if general defaults are fine and you are not running on a TPU, ๐ŸŒAccelerate has a utility to quickly write your GPU configuration into a config file via utils.write_basic_config().

The following code will restart Jupyter after writing the configuration, as CUDA code was called to perform this.

CUDA canโ€™t be initialized more than once on a multi-GPU system. Itโ€™s fine to debug in the notebook and have calls to CUDA, but in order to finally train a full cleanup and restart will need to be performed.

Copied

import os
from accelerate.utils import write_basic_config

write_basic_config()  # Write a config file
os._exit(00)  # Restart the notebook

Preparing the Dataset and Model

Next you should prepare your dataset. As mentioned at earlier, great care should be taken when preparing the DataLoaders and model to make sure that nothing is put on any GPU.

If you do, it is recommended to put that specific code into a function and call that from within the notebook launcher interface, which will be shown later.

Make sure the dataset is downloaded based on the directions here

Copied

import os, re, torch, PIL
import numpy as np

from torch.optim.lr_scheduler import OneCycleLR
from torch.utils.data import DataLoader, Dataset
from torchvision.transforms import Compose, RandomResizedCrop, Resize, ToTensor

from accelerate import Accelerator
from accelerate.utils import set_seed
from timm import create_model

First you need to create a function to extract the class name based on a filename:

Copied

import os

data_dir = "../../images"
fnames = os.listdir(data_dir)
fname = fnames[0]
print(fname)

Copied

beagle_32.jpg

In the case here, the label is beagle. Using regex you can extract the label from the filename:

Copied

import re


def extract_label(fname):
    stem = fname.split(os.path.sep)[-1]
    return re.search(r"^(.*)_\d+\.jpg$", stem).groups()[0]

Copied

extract_label(fname)

And you can see it properly returned the right name for our file:

Copied

"beagle"

Next a Dataset class should be made to handle grabbing the image and the label:

Copied

class PetsDataset(Dataset):
    def __init__(self, file_names, image_transform=None, label_to_id=None):
        self.file_names = file_names
        self.image_transform = image_transform
        self.label_to_id = label_to_id

    def __len__(self):
        return len(self.file_names)

    def __getitem__(self, idx):
        fname = self.file_names[idx]
        raw_image = PIL.Image.open(fname)
        image = raw_image.convert("RGB")
        if self.image_transform is not None:
            image = self.image_transform(image)
        label = extract_label(fname)
        if self.label_to_id is not None:
            label = self.label_to_id[label]
        return {"image": image, "label": label}

Now to build the dataset. Outside the training function you can find and declare all the filenames and labels and use them as references inside the launched function:

Copied

fnames = [os.path.join("../../images", fname) for fname in fnames if fname.endswith(".jpg")]

Next gather all the labels:

Copied

all_labels = [extract_label(fname) for fname in fnames]
id_to_label = list(set(all_labels))
id_to_label.sort()
label_to_id = {lbl: i for i, lbl in enumerate(id_to_label)}

Next, you should make a get_dataloaders function that will return your built dataloaders for you. As mentioned earlier, if data is automatically sent to the GPU or a TPU device when building your DataLoaders, they must be built using this method.

Copied

def get_dataloaders(batch_size: int = 64):
    "Builds a set of dataloaders with a batch_size"
    random_perm = np.random.permutation(len(fnames))
    cut = int(0.8 * len(fnames))
    train_split = random_perm[:cut]
    eval_split = random_perm[cut:]

    # For training a simple RandomResizedCrop will be used
    train_tfm = Compose([RandomResizedCrop((224, 224), scale=(0.5, 1.0)), ToTensor()])
    train_dataset = PetsDataset([fnames[i] for i in train_split], image_transform=train_tfm, label_to_id=label_to_id)

    # For evaluation a deterministic Resize will be used
    eval_tfm = Compose([Resize((224, 224)), ToTensor()])
    eval_dataset = PetsDataset([fnames[i] for i in eval_split], image_transform=eval_tfm, label_to_id=label_to_id)

    # Instantiate dataloaders
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size, num_workers=4)
    eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=batch_size * 2, num_workers=4)
    return train_dataloader, eval_dataloader

Finally, you should import the scheduler to be used later:

Copied

from torch.optim.lr_scheduler import CosineAnnealingLR

Writing the Training Function

Now you can build the training loop. notebook_launcher() works by passing in a function to call that will be ran across the distributed system.

Here is a basic training loop for the animal classification problem:

The code has been split up to allow for explainations on each section. A full version that can be copy and pasted will be available at the end

Copied

def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 64):
    set_seed(seed)
    accelerator = Accelerator(mixed_precision=mixed_precision)

First you should set the seed and create an Accelerator object as early in the training loop as possible.

If training on the TPU, your training loop should take in the model as a parameter and it should be instantiated outside of the training loop function. See the TPU best practices to learn why

Next you should build your dataloaders and create your model:

Copied

    train_dataloader, eval_dataloader = get_dataloaders(batch_size)
    model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))

You build the model here so that the seed also controls the new weight initialization

As you are performing transfer learning in this example, the encoder of the model starts out frozen so the head of the model can be trained only initially:

Copied

    for param in model.parameters():
        param.requires_grad = False
    for param in model.get_classifier().parameters():
        param.requires_grad = True

Normalizing the batches of images will make training a little faster:

Copied

    mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None]
    std = torch.tensor(model.default_cfg["std"])[None, :, None, None]

To make these constants available on the active device, you should set it to the Acceleratorโ€™s device:

Copied

    mean = mean.to(accelerator.device)
    std = std.to(accelerator.device)

Next instantiate the rest of the PyTorch classes used for training:

Copied

    optimizer = torch.optim.Adam(params=model.parameters(), lr=3e-2 / 25)
    lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=3e-2, epochs=5, steps_per_epoch=len(train_dataloader))

Before passing everything to prepare().

There is no specific order to remember, you just need to unpack the objects in the same order you gave them to the prepare method.

Copied

    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

Now train the model:

Copied

    for epoch in range(5):
        model.train()
        for batch in train_dataloader:
            inputs = (batch["image"] - mean) / std
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

The evaluation loop will look slightly different compared to the training loop. The number of elements passed as well as the overall total accuracy of each batch will be added to two constants:

Copied

        model.eval()
        accurate = 0
        num_elems = 0

Next you have the rest of your standard PyTorch loop:

Copied

        for batch in eval_dataloader:
            inputs = (batch["image"] - mean) / std
            with torch.no_grad():
                outputs = model(inputs)
            predictions = outputs.argmax(dim=-1)

Before finally the last major difference.

When performing distributed evaluation, the predictions and labels need to be passed through gather() so that all of the data is available on the current device and a properly calculated metric can be achieved:

Copied

            accurate_preds = accelerator.gather(predictions) == accelerator.gather(batch["label"])
            num_elems += accurate_preds.shape[0]
            accurate += accurate_preds.long().sum()

Now you just need to calculate the actual metric for this problem, and you can print it on the main process using print():

Copied

        eval_metric = accurate.item() / num_elems
        accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")

A full version of this training loop is available below:

Copied

def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 64):
    set_seed(seed)
    # Initialize accelerator
    accelerator = Accelerator(mixed_precision=mixed_precision)
    # Build dataloaders
    train_dataloader, eval_dataloader = get_dataloaders(batch_size)

    # Instantiate the model (you build the model here so that the seed also controls new weight initaliziations)
    model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))

    # Freeze the base model
    for param in model.parameters():
        param.requires_grad = False
    for param in model.get_classifier().parameters():
        param.requires_grad = True

    # You can normalize the batches of images to be a bit faster
    mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None]
    std = torch.tensor(model.default_cfg["std"])[None, :, None, None]

    # To make these constants available on the active device, set it to the accelerator device
    mean = mean.to(accelerator.device)
    std = std.to(accelerator.device)

    # Intantiate the optimizer
    optimizer = torch.optim.Adam(params=model.parameters(), lr=3e-2 / 25)

    # Instantiate the learning rate scheduler
    lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=3e-2, epochs=5, steps_per_epoch=len(train_dataloader))

    # Prepare everything
    # There is no specific order to remember, you just need to unpack the objects in the same order you gave them to the
    # prepare method.
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

    # Now you train the model
    for epoch in range(5):
        model.train()
        for batch in train_dataloader:
            inputs = (batch["image"] - mean) / std
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        model.eval()
        accurate = 0
        num_elems = 0
        for batch in eval_dataloader:
            inputs = (batch["image"] - mean) / std
            with torch.no_grad():
                outputs = model(inputs)
            predictions = outputs.argmax(dim=-1)
            accurate_preds = accelerator.gather(predictions) == accelerator.gather(batch["label"])
            num_elems += accurate_preds.shape[0]
            accurate += accurate_preds.long().sum()

        eval_metric = accurate.item() / num_elems
        # Use accelerator.print to print only on the main process.
        accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")

Using the notebook_launcher

All thatโ€™s left is to use the notebook_launcher().

You pass in the function, the arguments (as a tuple), and the number of processes to train on. (See the documentation for more information)

Copied

from accelerate import notebook_launcher

Copied

args = ("fp16", 42, 64)
notebook_launcher(training_loop, args, num_processes=2)

In the case of running on multiple nodes, you need to set up a Jupyter session at each node and run the launching cell at the same time.

For an environment containing 2 nodes (computers) with 8 GPUs each and the main computer with an IP address of โ€œ172.31.43.8โ€, it would look like so:

Copied

notebook_launcher(training_loop, args, master_addr="172.31.43.8", node_rank=0, num_nodes=2, num_processes=8)

And in the second Jupyter session on the other machine:

Notice how the node_rank has changed

Copied

notebook_launcher(training_loop, args, master_addr="172.31.43.8", node_rank=1, num_nodes=2, num_processes=8)

In the case of running on the TPU, it would look like so:

Copied

model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))

args = (model, "fp16", 42, 64)
notebook_launcher(training_loop, args, num_processes=8)

As itโ€™s running it will print the progress as well as state how many devices you ran on. This tutorial was ran with two GPUs:

Copied

Launching training on 2 GPUs.
epoch 0: 88.12
epoch 1: 91.73
epoch 2: 92.58
epoch 3: 93.90
epoch 4: 94.71

And thatโ€™s it!

Debugging

A common issue when running the notebook_launcher is receiving a CUDA has already been initialized issue. This usually stems from an import or prior code in the notebook that makes a call to the PyTorch torch.cuda sublibrary. To help narrow down what went wrong, you can launch the notebook_launcher with ACCELERATE_DEBUG_MODE=yes in your environment and an additional check will be made when spawning that a regular process can be created and utilize CUDA without issue. (Your CUDA code can still be ran afterwards).

Conclusion

This notebook showed how to perform distributed training from inside of a Jupyter Notebook. Some key notes to remember:

  • Make sure to save any code that use CUDA (or CUDA imports) for the function passed to notebook_launcher()

  • Set the num_processes to be the number of devices used for training (such as number of GPUs, CPUs, TPUs, etc)

  • If using the TPU, declare your model outside the training loop function

Last updated