Custom hardware for training
The hardware you use to run model training and inference can have a big effect on performance. For a deep dive into GPUs make sure to check out Tim Dettmer’s excellent blog post.
Let’s have a look at some practical advice for GPU setups.
GPU
When you train bigger models you have essentially three options:
bigger GPUs
more GPUs
more CPU and NVMe (offloaded to by DeepSpeed-Infinity)
Let’s start at the case where you have a single GPU.
Power and Cooling
If you bought an expensive high end GPU make sure you give it the correct power and sufficient cooling.
Power:
Some high end consumer GPU cards have 2 and sometimes 3 PCI-E 8-Pin power sockets. Make sure you have as many independent 12V PCI-E 8-Pin cables plugged into the card as there are sockets. Do not use the 2 splits at one end of the same cable (also known as pigtail cable). That is if you have 2 sockets on the GPU, you want 2 PCI-E 8-Pin cables going from your PSU to the card and not one that has 2 PCI-E 8-Pin connectors at the end! You won’t get the full performance out of your card otherwise.
Each PCI-E 8-Pin power cable needs to be plugged into a 12V rail on the PSU side and can supply up to 150W of power.
Some other cards may use a PCI-E 12-Pin connectors, and these can deliver up to 500-600W of power.
Low end cards may use 6-Pin connectors, which supply up to 75W of power.
Additionally you want the high-end PSU that has stable voltage. Some lower quality ones may not give the card the stable voltage it needs to function at its peak.
And of course the PSU needs to have enough unused Watts to power the card.
Cooling:
When a GPU gets overheated it will start throttling down and will not deliver full performance and it can even shutdown if it gets too hot.
It’s hard to tell the exact best temperature to strive for when a GPU is heavily loaded, but probably anything under +80C is good, but lower is better - perhaps 70-75C is an excellent range to be in. The throttling down is likely to start at around 84-90C. But other than throttling performance a prolonged very high temperature is likely to reduce the lifespan of a GPU.
Next let’s have a look at one of the most important aspects when having multiple GPUs: connectivity.
Multi-GPU Connectivity
If you use multiple GPUs the way cards are inter-connected can have a huge impact on the total training time. If the GPUs are on the same physical node, you can run:
Copied
and it will tell you how the GPUs are inter-connected. On a machine with dual-GPU and which are connected with NVLink, you will most likely see something like:
Copied
on a different machine w/o NVLink we may see:
Copied
The report includes this legend:
Copied
So the first report NV2
tells us the GPUs are interconnected with 2 NVLinks, and the second report PHB
we have a typical consumer-level PCIe+Bridge setup.
Check what type of connectivity you have on your setup. Some of these will make the communication between cards faster (e.g. NVLink), others slower (e.g. PHB).
Depending on the type of scalability solution used, the connectivity speed could have a major or a minor impact. If the GPUs need to sync rarely, as in DDP, the impact of a slower connection will be less significant. If the GPUs need to send messages to each other often, as in ZeRO-DP, then faster connectivity becomes super important to achieve faster training.
NVlink
NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia.
Each new generation provides a faster bandwidth, e.g. here is a quote from Nvidia Ampere GA102 GPU Architecture:
Third-Generation NVLink® GA102 GPUs utilize NVIDIA’s third-generation NVLink interface, which includes four x4 links, with each link providing 14.0625 GB/sec bandwidth in each direction between two GPUs. Four links provide 56.25 GB/sec bandwidth in each direction, and 112.5 GB/sec total bandwidth between two GPUs. Two RTX 3090 GPUs can be connected together for SLI using NVLink. (Note that 3-Way and 4-Way SLI configurations are not supported.)
So the higher X
you get in the report of NVX
in the output of nvidia-smi topo -m
the better. The generation will depend on your GPU architecture.
Let’s compare the execution of a gpt2 language model training over a small sample of wikitext.
The results are:
Y
101s
N
131s
You can see that NVLink completes the training ~23% faster. In the second benchmark we use NCCL_P2P_DISABLE=1
to tell the GPUs not to use NVLink.
Here is the full benchmark code and outputs:
Copied
Hardware: 2x TITAN RTX 24GB each + NVlink with 2 NVLinks (NV2
in nvidia-smi topo -m
) Software: pytorch-1.8-to-be
+ cuda-11.0
/ transformers==4.3.0.dev0
Last updated