How to use Megatron-LM
Megatron-LM
Megatron-LM enables training large transformer language models at scale. It provides efficient tensor, pipeline and sequence based model parallelism for pre-training transformer based Language Models such as GPT (Decoder Only), BERT (Encoder Only) and T5 (Encoder-Decoder). For detailed information and how things work behind the scene please refer the github repo.
What is integrated?
Accelerate integrates following feature of Megatron-LM to enable large scale pre-training/finetuning of BERT (Encoder), GPT (Decoder) or T5 models (Encoder and Decoder):
a. Tensor Parallelism (TP): Reduces memory footprint without much additional communication on intra-node ranks. Each tensor is split into multiple chunks with each shard residing on separate GPU. At each step, the same mini-batch of data is processed independently and in parallel by each shard followed by syncing across all GPUs (all-reduce
operation). In a simple transformer layer, this leads to 2 all-reduces
in the forward path and 2 in the backward path. For more details, please refer research paper Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism and this section of ๐ blogpost The Technology Behind BLOOM Training.
b. Pipeline Parallelism (PP): Reduces memory footprint and enables large scale training via inter-node parallelization. Reduces the bubble of naive PP via PipeDream-Flush schedule/1F1B schedule and Interleaved 1F1B schedule. Layers are distributed uniformly across PP stages. For example, if a model has 24
layers and we have 4
GPUs for pipeline parallelism, each GPU will have 6
layers (24/4). For more details on schedules to reduce the idle time of PP, please refer to the research paper Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM and this section of ๐ blogpost The Technology Behind BLOOM Training.
c. Sequence Parallelism (SP): Reduces memory footprint without any additional communication. Only applicable when using TP. It reduces activation memory required as it prevents the same copies to be on the tensor parallel ranks post all-reduce
by replacing then with reduce-scatter
and no-op
operation would be replaced by all-gather
. As all-reduce = reduce-scatter + all-gather
, this saves a ton of activation memory at no added communication cost. To put it simply, it shards the outputs of each transformer layer along sequence dimension, e.g., if the sequence length is 1024
and the TP size is 4
, each GPU will have 256
tokens (1024/4) for each sample. This increases the batch size that can be supported for training. For more details, please refer to the research paper Reducing Activation Recomputation in Large Transformer Models.
d. Data Parallelism (DP) via Distributed Optimizer: Reduces the memory footprint by sharding optimizer states and gradients across DP ranks (versus the traditional method of replicating the optimizer state across data parallel ranks). For example, when using Adam optimizer with mixed-precision training, each parameter accounts for 12 bytes of memory. This gets distributed equally across the GPUs, i.e., each parameter would account for 3 bytes (12/4) if we have 4 GPUs. For more details, please refer the research paper ZeRO: Memory Optimizations Toward Training Trillion Parameter Models and following section of ๐ blog The Technology Behind BLOOM Training.
e. Selective Activation Recomputation: Reduces the memory footprint of activations significantly via smart activation checkpointing. It doesnโt store activations occupying large memory while being fast to recompute thereby achieving great tradeoff between memory and recomputation. For example, for GPT-3, this leads to 70% reduction in required memory for activations at the expense of only 2.7% FLOPs overhead for recomputation of activations. For more details, please refer to the research paper Reducing Activation Recomputation in Large Transformer Models.
f. Fused Kernels: Fused Softmax, Mixed Precision Fused Layer Norm and Fused gradient accumulation to weight gradient computation of linear layer. PyTorch JIT compiled Fused GeLU and Fused Bias+Dropout+Residual addition.
g. Support for Indexed datasets: Efficient binary format of datasets for large scale training. Support for the mmap
, cached
index file and the lazy
loader format.
h. Checkpoint reshaping and interoperability: Utility for reshaping Megatron-LM checkpoints of variable tensor and pipeline parallel sizes to the beloved ๐ Transformers sharded checkpoints as it has great support with plethora of tools such as ๐ Accelerate Big Model Inference, Megatron-DeepSpeed Inference etc. Support is also available for converting ๐ Transformers sharded checkpoints to Megatron-LM checkpoint of variable tensor and pipeline parallel sizes for large scale training.
Pre-Requisites
You will need to install the latest pytorch, cuda, nccl, and NVIDIA APEX releases and the nltk library. See documentation for more details. Another way to setup the environment is to pull an NVIDIA PyTorch Container that comes with all the required installations from NGC.
Below is a step-by-step method to set up the conda environment:
Create a virtual environment
Copied
Assuming that the machine has CUDA 11.3 installed, installing the corresponding PyTorch GPU Version
Copied
Install Nvidia APEX
Copied
Installing Megatron-LM
Copied
Accelerate Megatron-LM Plugin
Important features are directly supported via the accelerate config
command. An example of thr corresponding questions for using Megatron-LM features is shown below:
Copied
The resulting config is shown below:
Copied
We will take the example of GPT pre-training. The minimal changes required to the official run_clm_no_trainer.py
to use Megatron-LM are as follows:
As Megatron-LM uses its own implementation of Optimizer, the corresponding scheduler compatible with it needs to be used. As such, support for only the Megatron-LMโs scheduler is present. User will need to create
accelerate.utils.MegatronLMDummyScheduler
. Example is given below:
Copied
Getting the details of the total batch size now needs to be cognization of tensor and pipeline parallel sizes. Example of getting the effective total batch size is shown below:
Copied
When using Megatron-LM, the losses are already averaged across the data parallel group
Copied
For Megatron-LM, we need to save the model using
accelerator.save_state
Copied
Thatโs it! We are good to go ๐. Please find the example script in the examples folder at the path accelerate/examples/by_feature/megatron_lm_gpt_pretraining.py
. Letโs run it for gpt-large
model architecture using 4 A100-80GB GPUs.
Copied
Below are some important excerpts from the output logs:
Copied
There are a large number of other options/features that one can set using accelerate.utils.MegatronLMPlugin
.
Advanced features to leverage writing custom train step and Megatron-LM Indexed Datasets
For leveraging more features, please go through below details.
Below is an example of changes required to customize the Train Step while using Megatron-LM. You will implement the
accelerate.utils.AbstractTrainStep
or inherit from their corresponding childrenaccelerate.utils.GPTTrainStep
,accelerate.utils.BertTrainStep
oraccelerate.utils.T5TrainStep
.
Copied
For using the Megatron-LM datasets, a few more changes are required. Dataloaders for these datasets are available only on rank 0 of each tensor parallel group. As such, there are rank where dataloader wonโt be avaiable and this requires tweaks to the training loop. Being able to do all this shows how felixble and extensible ๐ Accelerate is. The changes required are as follows.
a. For Megatron-LM indexed datasets, we need to use MegatronLMDummyDataLoader
and pass the required dataset args to it such as data_path
, seq_length
etc. See here for the list of available args.
Copied
b. megatron_dataloader
is repeated 3 times to get training, validation and test dataloaders as per the args.splits_string
proportions
Copied
c. Changes to training and evaluation loops as dataloader is only available on tensor parallel ranks 0 So, we need to iterate only if the dataloader isnโt None
else provide empty dict As such, we loop using while
loop and break when completed_steps
is equal to args.max_train_steps
This is similar to the Megatron-LM setup wherein user has to provide max_train_steps
when using Megaton-LM indexed datasets. This displays how flexible and extensible ๐ Accelerate is.
Copied
Utility for Checkpoint reshaping and interoperability
The scripts for these are present in ๐ Transformers library under respective models. Currently, it is available for GPT model checkpoint_reshaping_and_interoperability.py
Below is an example of conversion of checkpoint from Megatron-LM to universal ๐ Transformers sharded checkpoint.
Copied
Conversion of checkpoint from transformers to megatron with
tp_size=2
,pp_size=2
anddp_size=2
.
Copied
Megatron-LM GPT models support returning logits and megatron_generate function for text generation
Returning logits require setting
require_logits=True
in MegatronLMPlugin as shown below. These would be available on the in the last stage of pipeline.
Copied
megatron_generate
method for Megatron-LM GPT model: This will use Tensor and Pipeline Parallelism to complete generations for a batch of inputs when using greedy with/without top_k/top_p sampling and for individual prompt inputs when using beam search decoding. Only a subset of features of transformers generate is supported. This will help in using large models via tensor and pipeline parallelism for generation (already does key-value caching and uses fused kernels by default). This requires data parallel size to be 1, sequence parallelism and activation checkpointing to be disabled. It also requires specifying path to tokenizerโs vocab file and merges file. Below example shows how to configure and usemegatron_generate
method for Megatron-LM GPT model.
Copied
An end-to-end example of using
megatron_generate
method for Megatron-LM GPT model is available at megatron_gpt2_generation.py with config file megatron_lm_gpt_generate_config.yaml. The bash script with accelerate launch command is available at megatron_lm_gpt_generate.sh. The output logs of the script are available at megatron_lm_gpt_generate.log.
Support for ROPE and ALiBi Positional embeddings and Multi-Query Attention
For ROPE/ALiBi attention, pass
position_embedding_type
with("absolute" | "rotary" | "alibi")
toMegatronLMPlugin
as shown below.
Copied
For Multi-Query Attention, pass
attention_head_type
with("multihead" | "multiquery")
toMegatronLMPlugin
as shown below.
Copied
Caveats
Supports Transformers GPT2, Megatron-BERT and T5 models. This covers Decoder only, Encode only and Encoder-Decoder model classes.
Only loss is returned from model forward pass as there is quite complex interplay of pipeline, tensor and data parallelsim behind the scenes. The
model(**batch_data)
call return loss(es) averaged across the data parallel ranks. This is fine for most cases wherein pre-training jobs are run using Megatron-LM features and you can easily compute theperplexity
using the loss. For GPT model, returning logits in addition to loss(es) is supported. These logits arenโt gathered across data prallel ranks. Useaccelerator.utils.gather_across_data_parallel_groups
to gather logits across data parallel ranks. These logits along with labels can be used for computing various performance metrics.The main process is the last rank as the losses/logits are available in the last stage of pipeline.
accelerator.is_main_process
andaccelerator.is_local_main_process
returnTrue
for last rank when using Megatron-LM integration.In
accelerator.prepare
call, a Megatron-LM model corresponding to a given Transformers model is created with random weights. Please useaccelerator.load_state
to load the Megatron-LM checkpoint with matching TP, PP and DP partitions.Currently, checkpoint reshaping and interoperability support is only available for GPT. Soon it will be extended to BERT and T5.
gradient_accumulation_steps
needs to be 1. When using Megatron-LM, micro batches in pipeline parallelism setting is synonymous with gradient accumulation.When using Megatron-LM, use
accelerator.save_state
andaccelerator.load_state
for saving and loading checkpoints.Below are the mapping from Megatron-LM model architectures to the the equivalent ๐ transformers model architectures. Only these ๐ transformers model architectures are supported.
a. Megatron-LM BertModel : ๐ transformers models with megatron-bert
in configโs model type, e.g., MegatronBERT
b. Megatron-LM GPTModel :๐transformers models with gpt2
in configโs model type, e.g., OpenAI GPT2
c. Megatron-LM T5Model : ๐ transformers models with t5
in configโs model type, e.g., T5 and MT5
Last updated