How to use BOINC AI Accelerate with Intel® Extension for PyTorch for cpu
Intel® Extension for PyTorch
IPEX is optimized for CPUs with AVX-512 or above, and functionally works for CPUs with only AVX2. So, it is expected to bring performance benefit for Intel CPU generations with AVX-512 or above while CPUs with only AVX2 (e.g., AMD CPUs or older Intel CPUs) might result in a better performance under IPEX, but not guaranteed. IPEX provides performance optimizations for CPU training with both Float32 and BFloat16. The usage of BFloat16 is the main focus of the following sections.
Low precision data type BFloat16 has been natively supported on the 3rd Generation Xeon® Scalable Processors (aka Cooper Lake) with AVX512 instruction set and will be supported on the next generation of Intel® Xeon® Scalable Processors with Intel® Advanced Matrix Extensions (Intel® AMX) instruction set with further boosted performance. The Auto Mixed Precision for CPU backend has been enabled since PyTorch-1.10. At the same time, the support of Auto Mixed Precision with BFloat16 for CPU and BFloat16 optimization of operators has been massively enabled in Intel® Extension for PyTorch, and partially upstreamed to PyTorch master branch. Users can get better performance and user experience with IPEX Auto Mixed Precision.
IPEX installation:
IPEX release is following PyTorch, to install via pip:
2.0
2.0.0
1.13
1.13.0
1.12
1.12.300
1.11
1.11.200
1.10
1.10.100
Copied
pip install intel_extension_for_pytorch==<version_name> -f https://developer.intel.com/ipex-whl-stable-cpuCheck more approaches for IPEX installation.
How It Works For Training optimization in CPU
🌍 Accelerate has integrated IPEX, all you need to do is enabling it through the config.
Scenario 1: Acceleration of No distributed CPU training
Run accelerate config on your machine:
Copied
This will generate a config file that will be used automatically to properly set the default options when doing
Copied
For instance, here is how you would run the NLP example examples/nlp_example.py (from the root of the repo) with IPEX enabled. default_config.yaml that is generated after accelerate config
Copied
Copied
Scenario 2: Acceleration of distributed CPU training we use Intel oneCCL for communication, combined with Intel® MPI library to deliver flexible, efficient, scalable cluster messaging on Intel® architecture. you could refer the here for the installation guide
Run accelerate config on your machine(node0):
Copied
For instance, here is how you would run the NLP example examples/nlp_example.py (from the root of the repo) with IPEX enabled for distributed CPU training.
default_config.yaml that is generated after accelerate config
Copied
Set following env and using intel MPI to launch the training
In node0, you need to create a configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument.
Copied
Now, run the following command in node0 and 16DDP will be enabled in node0,node1,node2,node3 with BF16 mixed precision:
Copied
Related Resources
Last updated