How to use Megatron-LM

Megatron-LM

Megatron-LM enables training large transformer language models at scale. It provides efficient tensor, pipeline and sequence based model parallelism for pre-training transformer based Language Models such as GPT (Decoder Only), BERT (Encoder Only) and T5 (Encoder-Decoder). For detailed information and how things work behind the scene please refer the github repo.

What is integrated?

Accelerate integrates following feature of Megatron-LM to enable large scale pre-training/finetuning of BERT (Encoder), GPT (Decoder) or T5 models (Encoder and Decoder):

a. Tensor Parallelism (TP): Reduces memory footprint without much additional communication on intra-node ranks. Each tensor is split into multiple chunks with each shard residing on separate GPU. At each step, the same mini-batch of data is processed independently and in parallel by each shard followed by syncing across all GPUs (all-reduce operation). In a simple transformer layer, this leads to 2 all-reduces in the forward path and 2 in the backward path. For more details, please refer research paper Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism and this section of ๐ŸŒ blogpost The Technology Behind BLOOM Training.

b. Pipeline Parallelism (PP): Reduces memory footprint and enables large scale training via inter-node parallelization. Reduces the bubble of naive PP via PipeDream-Flush schedule/1F1B schedule and Interleaved 1F1B schedule. Layers are distributed uniformly across PP stages. For example, if a model has 24 layers and we have 4 GPUs for pipeline parallelism, each GPU will have 6 layers (24/4). For more details on schedules to reduce the idle time of PP, please refer to the research paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM and this section of ๐ŸŒ blogpost The Technology Behind BLOOM Training.

c. Sequence Parallelism (SP): Reduces memory footprint without any additional communication. Only applicable when using TP. It reduces activation memory required as it prevents the same copies to be on the tensor parallel ranks post all-reduce by replacing then with reduce-scatter and no-op operation would be replaced by all-gather. As all-reduce = reduce-scatter + all-gather, this saves a ton of activation memory at no added communication cost. To put it simply, it shards the outputs of each transformer layer along sequence dimension, e.g., if the sequence length is 1024 and the TP size is 4, each GPU will have 256 tokens (1024/4) for each sample. This increases the batch size that can be supported for training. For more details, please refer to the research paper Reducing Activation Recomputation in Large Transformer Models.

d. Data Parallelism (DP) via Distributed Optimizer: Reduces the memory footprint by sharding optimizer states and gradients across DP ranks (versus the traditional method of replicating the optimizer state across data parallel ranks). For example, when using Adam optimizer with mixed-precision training, each parameter accounts for 12 bytes of memory. This gets distributed equally across the GPUs, i.e., each parameter would account for 3 bytes (12/4) if we have 4 GPUs. For more details, please refer the research paper ZeRO: Memory Optimizations Toward Training Trillion Parameter Models and following section of ๐ŸŒ blog The Technology Behind BLOOM Training.

e. Selective Activation Recomputation: Reduces the memory footprint of activations significantly via smart activation checkpointing. It doesnโ€™t store activations occupying large memory while being fast to recompute thereby achieving great tradeoff between memory and recomputation. For example, for GPT-3, this leads to 70% reduction in required memory for activations at the expense of only 2.7% FLOPs overhead for recomputation of activations. For more details, please refer to the research paper Reducing Activation Recomputation in Large Transformer Models.

f. Fused Kernels: Fused Softmax, Mixed Precision Fused Layer Norm and Fused gradient accumulation to weight gradient computation of linear layer. PyTorch JIT compiled Fused GeLU and Fused Bias+Dropout+Residual addition.

g. Support for Indexed datasets: Efficient binary format of datasets for large scale training. Support for the mmap, cached index file and the lazy loader format.

h. Checkpoint reshaping and interoperability: Utility for reshaping Megatron-LM checkpoints of variable tensor and pipeline parallel sizes to the beloved ๐ŸŒ Transformers sharded checkpoints as it has great support with plethora of tools such as ๐ŸŒ Accelerate Big Model Inference, Megatron-DeepSpeed Inference etc. Support is also available for converting ๐ŸŒ Transformers sharded checkpoints to Megatron-LM checkpoint of variable tensor and pipeline parallel sizes for large scale training.

Pre-Requisites

You will need to install the latest pytorch, cuda, nccl, and NVIDIA APEX releases and the nltk library. See documentation for more details. Another way to setup the environment is to pull an NVIDIA PyTorch Container that comes with all the required installations from NGC.

Below is a step-by-step method to set up the conda environment:

  1. Create a virtual environment

Copied

  1. Assuming that the machine has CUDA 11.3 installed, installing the corresponding PyTorch GPU Version

Copied

  1. Install Nvidia APEX

Copied

  1. Installing Megatron-LM

Copied

Accelerate Megatron-LM Plugin

Important features are directly supported via the accelerate config command. An example of thr corresponding questions for using Megatron-LM features is shown below:

Copied

The resulting config is shown below:

Copied

We will take the example of GPT pre-training. The minimal changes required to the official run_clm_no_trainer.py to use Megatron-LM are as follows:

  1. As Megatron-LM uses its own implementation of Optimizer, the corresponding scheduler compatible with it needs to be used. As such, support for only the Megatron-LMโ€™s scheduler is present. User will need to create accelerate.utils.MegatronLMDummyScheduler. Example is given below:

Copied

  1. Getting the details of the total batch size now needs to be cognization of tensor and pipeline parallel sizes. Example of getting the effective total batch size is shown below:

Copied

  1. When using Megatron-LM, the losses are already averaged across the data parallel group

Copied

  1. For Megatron-LM, we need to save the model using accelerator.save_state

Copied

Thatโ€™s it! We are good to go ๐Ÿš€. Please find the example script in the examples folder at the path accelerate/examples/by_feature/megatron_lm_gpt_pretraining.py. Letโ€™s run it for gpt-large model architecture using 4 A100-80GB GPUs.

Copied

Below are some important excerpts from the output logs:

Copied

There are a large number of other options/features that one can set using accelerate.utils.MegatronLMPlugin.

Advanced features to leverage writing custom train step and Megatron-LM Indexed Datasets

For leveraging more features, please go through below details.

  1. Below is an example of changes required to customize the Train Step while using Megatron-LM. You will implement the accelerate.utils.AbstractTrainStep or inherit from their corresponding children accelerate.utils.GPTTrainStep, accelerate.utils.BertTrainStep or accelerate.utils.T5TrainStep.

Copied

  1. For using the Megatron-LM datasets, a few more changes are required. Dataloaders for these datasets are available only on rank 0 of each tensor parallel group. As such, there are rank where dataloader wonโ€™t be avaiable and this requires tweaks to the training loop. Being able to do all this shows how felixble and extensible ๐ŸŒ Accelerate is. The changes required are as follows.

a. For Megatron-LM indexed datasets, we need to use MegatronLMDummyDataLoader and pass the required dataset args to it such as data_path, seq_length etc. See here for the list of available args.

Copied

b. megatron_dataloader is repeated 3 times to get training, validation and test dataloaders as per the args.splits_string proportions

Copied

c. Changes to training and evaluation loops as dataloader is only available on tensor parallel ranks 0 So, we need to iterate only if the dataloader isnโ€™t None else provide empty dict As such, we loop using while loop and break when completed_steps is equal to args.max_train_steps This is similar to the Megatron-LM setup wherein user has to provide max_train_steps when using Megaton-LM indexed datasets. This displays how flexible and extensible ๐ŸŒ Accelerate is.

Copied

Utility for Checkpoint reshaping and interoperability

  1. The scripts for these are present in ๐ŸŒ Transformers library under respective models. Currently, it is available for GPT model checkpoint_reshaping_and_interoperability.py

  2. Below is an example of conversion of checkpoint from Megatron-LM to universal ๐ŸŒ Transformers sharded checkpoint.

Copied

  1. Conversion of checkpoint from transformers to megatron with tp_size=2, pp_size=2 and dp_size=2.

Copied

Megatron-LM GPT models support returning logits and megatron_generate function for text generation

  1. Returning logits require setting require_logits=True in MegatronLMPlugin as shown below. These would be available on the in the last stage of pipeline.

Copied

  1. megatron_generate method for Megatron-LM GPT model: This will use Tensor and Pipeline Parallelism to complete generations for a batch of inputs when using greedy with/without top_k/top_p sampling and for individual prompt inputs when using beam search decoding. Only a subset of features of transformers generate is supported. This will help in using large models via tensor and pipeline parallelism for generation (already does key-value caching and uses fused kernels by default). This requires data parallel size to be 1, sequence parallelism and activation checkpointing to be disabled. It also requires specifying path to tokenizerโ€™s vocab file and merges file. Below example shows how to configure and use megatron_generate method for Megatron-LM GPT model.

Copied

  1. An end-to-end example of using megatron_generate method for Megatron-LM GPT model is available at megatron_gpt2_generation.py with config file megatron_lm_gpt_generate_config.yaml. The bash script with accelerate launch command is available at megatron_lm_gpt_generate.sh. The output logs of the script are available at megatron_lm_gpt_generate.log.

Support for ROPE and ALiBi Positional embeddings and Multi-Query Attention

  1. For ROPE/ALiBi attention, pass position_embedding_type with ("absolute" | "rotary" | "alibi") to MegatronLMPlugin as shown below.

Copied

  1. For Multi-Query Attention, pass attention_head_type with ("multihead" | "multiquery") to MegatronLMPlugin as shown below.

Copied

Caveats

  1. Supports Transformers GPT2, Megatron-BERT and T5 models. This covers Decoder only, Encode only and Encoder-Decoder model classes.

  2. Only loss is returned from model forward pass as there is quite complex interplay of pipeline, tensor and data parallelsim behind the scenes. The model(**batch_data) call return loss(es) averaged across the data parallel ranks. This is fine for most cases wherein pre-training jobs are run using Megatron-LM features and you can easily compute the perplexity using the loss. For GPT model, returning logits in addition to loss(es) is supported. These logits arenโ€™t gathered across data prallel ranks. Use accelerator.utils.gather_across_data_parallel_groups to gather logits across data parallel ranks. These logits along with labels can be used for computing various performance metrics.

  3. The main process is the last rank as the losses/logits are available in the last stage of pipeline. accelerator.is_main_process and accelerator.is_local_main_process return True for last rank when using Megatron-LM integration.

  4. In accelerator.prepare call, a Megatron-LM model corresponding to a given Transformers model is created with random weights. Please use accelerator.load_state to load the Megatron-LM checkpoint with matching TP, PP and DP partitions.

  5. Currently, checkpoint reshaping and interoperability support is only available for GPT. Soon it will be extended to BERT and T5.

  6. gradient_accumulation_steps needs to be 1. When using Megatron-LM, micro batches in pipeline parallelism setting is synonymous with gradient accumulation.

  7. When using Megatron-LM, use accelerator.save_state and accelerator.load_state for saving and loading checkpoints.

  8. Below are the mapping from Megatron-LM model architectures to the the equivalent ๐ŸŒ transformers model architectures. Only these ๐ŸŒ transformers model architectures are supported.

a. Megatron-LM BertModel : ๐ŸŒ transformers models with megatron-bert in configโ€™s model type, e.g., MegatronBERT

b. Megatron-LM GPTModel :๐ŸŒtransformers models with gpt2 in configโ€™s model type, e.g., OpenAI GPT2

c. Megatron-LM T5Model : ๐ŸŒ transformers models with t5 in configโ€™s model type, e.g., T5 and MT5

Last updated