How to quantize model
Last updated
Last updated
🌍 Accelerate brings bitsandbytes
quantization to your model. You can now load any pytorch model in 8-bit or 4-bit with a few lines of code.
If you want to use 🌍 Transformers models with bitsandbytes
, you should follow this .
To learn more about how the bitsandbytes
quantization works, check out the blog posts on and .
You will need to install the following requirements:
Install bitsandbytes
library
Copied
Install latest accelerate
from source
Copied
Install minGPT
and boincai_hub
to run examples
Copied
Let’s take the GPT2 model from minGPT library.
Copied
Then, we need to get the path to the weights of your model. The path can be the state_dict file (e.g. “pytorch_model.bin”) or a folder containing the sharded checkpoints.
Copied
Here’s an example for 8-bit quantization:
Copied
Here’s an example for 4-bit quantization:
Copied
Copied
Copied
Note that 4-bit model serialization is currently not supported.
For 8-bit quantization, the selected modules will be converted to 8-bit precision.
For 4-bit quantization, the selected modules will be kept in torch_dtype
that the user passed in BnbQuantizationConfig
. We will add support to convert these offloaded modules in 4-bit when 4-bit serialization will be possible.
You just need to pass a custom device_map
in order to offload modules on cpu/disk. The offload modules will be dispatched on the GPU when needed. Here’s an example :
Copied
Note that you don’t need to pass device_map
when loading the model for training. It will automatically load your model on your GPU. Please note that device_map=auto
should be used for inference only.
First, we need to initialize our model. To save memory, we can initialize an empty model using the context manager .
Finally, you need to set your quantization configuration with .
To quantize your empty model with the selected configuration, you need to use .
You can save your 8-bit model with accelerate using .
You can offload some modules to cpu/disk if you don’t have enough space on the GPU to store the entire model on your GPUs. This uses big model inference under the hood. Check this for more details.
It is not possible to perform pure 8bit or 4bit training on these models. However, you can train these models by leveraging parameter efficient fine tuning methods (PEFT) and train for example adapters on top of them. Please have a look at library for more details.
Currently, you can’t add adapters on top of any quantized model. However, with the official support of adapters with 🌍 Transformers models, you can fine-tune quantized models. If you want to finetune a 🌍 Transformers model , follow this instead. Check out this on how to fine-tune a 4-bi 🌍 Transformers model.
Check out the Google Colab for running quantized models on a GTP2 model. The GPT2-1.5B model checkpoint is in FP32 which uses 6GB of memory. After quantization, it uses 1.6GB with 8-bit modules and 1.2GB with 4-bit modules.