Accelerate
  • 🌍GETTING STARTED
    • BOINC AI Accelerate
    • Installation
    • Quicktour
  • 🌍TUTORIALS
    • Overview
    • Migrating to BOINC AI Accelerate
    • Launching distributed code
    • Launching distributed training from Jupyter Notebooks
  • 🌍HOW-TO GUIDES
    • Start Here!
    • Example Zoo
    • How to perform inference on large models with small resources
    • Knowing how big of a model you can fit into memory
    • How to quantize model
    • How to perform distributed inference with normal resources
    • Performing gradient accumulation
    • Accelerating training with local SGD
    • Saving and loading training states
    • Using experiment trackers
    • Debugging timeout errors
    • How to avoid CUDA Out-of-Memory
    • How to use Apple Silicon M1 GPUs
    • How to use DeepSpeed
    • How to use Fully Sharded Data Parallelism
    • How to use Megatron-LM
    • How to use BOINC AI Accelerate with SageMaker
    • How to use BOINC AI Accelerate with Intel® Extension for PyTorch for cpu
  • 🌍CONCEPTS AND FUNDAMENTALS
    • BOINC AI Accelerate's internal mechanism
    • Loading big models into memory
    • Comparing performance across distributed setups
    • Executing and deferring jobs
    • Gradient synchronization
    • TPU best practices
  • 🌍REFERENCE
    • Main Accelerator class
    • Stateful configuration classes
    • The Command Line
    • Torch wrapper classes
    • Experiment trackers
    • Distributed launchers
    • DeepSpeed utilities
    • Logging
    • Working with large models
    • Kwargs handlers
    • Utility functions and classes
    • Megatron-LM Utilities
    • Fully Sharded Data Parallelism Utilities
Powered by GitBook
On this page
  • Quantization
  • bitsandbytes Integration
  1. HOW-TO GUIDES

How to quantize model

PreviousKnowing how big of a model you can fit into memoryNextHow to perform distributed inference with normal resources

Last updated 1 year ago

Quantization

bitsandbytes Integration

🌍 Accelerate brings bitsandbytes quantization to your model. You can now load any pytorch model in 8-bit or 4-bit with a few lines of code.

If you want to use 🌍 Transformers models with bitsandbytes, you should follow this .

To learn more about how the bitsandbytes quantization works, check out the blog posts on and .

Pre-Requisites

You will need to install the following requirements:

  • Install bitsandbytes library

Copied

pip install bitsandbytes
  • Install latest accelerate from source

Copied

pip install git+https://github.com/boincai/accelerate.git
  • Install minGPT and boincai_hub to run examples

Copied

git clone https://github.com/karpathy/minGPT.git
pip install minGPT/
pip install boincai_hub

How it works

Let’s take the GPT2 model from minGPT library.

Copied

from accelerate import init_empty_weights
from mingpt.model import GPT

model_config = GPT.get_default_config()
model_config.model_type = 'gpt2-xl'
model_config.vocab_size = 50257
model_config.block_size = 1024

with init_empty_weights():
    empty_model = GPT(model_config)

Then, we need to get the path to the weights of your model. The path can be the state_dict file (e.g. “pytorch_model.bin”) or a folder containing the sharded checkpoints.

Copied

from boincai_hub import snapshot_download
weights_location = snapshot_download(repo_id="marcsun13/gpt2-xl-linear-sharded")

Here’s an example for 8-bit quantization:

Copied

from accelerate.utils import BnbQuantizationConfig
bnb_quantization_config = BnbQuantizationConfig(load_in_8bit=True, llm_int8_threshold = 6)

Here’s an example for 4-bit quantization:

Copied

from accelerate.utils import BnbQuantizationConfig
bnb_quantization_config = BnbQuantizationConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")

Copied

from accelerate.utils import load_and_quantize_model
quantized_model = load_and_quantize_model(empty_model, weights_location=weights_location, bnb_quantization_config=bnb_quantization_config, device_map = "auto")

Saving and loading 8-bit model

Copied

from accelerate import Accelerator
accelerate = Accelerator()
new_weights_location = "path/to/save_directory"
accelerate.save_model(quantized_model, new_weights_location)

quantized_model_from_saved = load_and_quantize_model(empty_model, weights_location=new_weights_location, bnb_quantization_config=bnb_quantization_config, device_map = "auto")

Note that 4-bit model serialization is currently not supported.

Offload modules to cpu and disk

For 8-bit quantization, the selected modules will be converted to 8-bit precision.

For 4-bit quantization, the selected modules will be kept in torch_dtype that the user passed in BnbQuantizationConfig. We will add support to convert these offloaded modules in 4-bit when 4-bit serialization will be possible.

You just need to pass a custom device_map in order to offload modules on cpu/disk. The offload modules will be dispatched on the GPU when needed. Here’s an example :

Copied

device_map = {
    "transformer.wte": 0,
    "transformer.wpe": 0,
    "transformer.drop": 0,
    "transformer.h": "cpu",
    "transformer.ln_f": "disk",
    "lm_head": "disk",
}

Fine-tune a quantized model

Note that you don’t need to pass device_map when loading the model for training. It will automatically load your model on your GPU. Please note that device_map=auto should be used for inference only.

Example demo - running GPT2 1.5b on a Google Colab

First, we need to initialize our model. To save memory, we can initialize an empty model using the context manager .

Finally, you need to set your quantization configuration with .

To quantize your empty model with the selected configuration, you need to use .

You can save your 8-bit model with accelerate using .

You can offload some modules to cpu/disk if you don’t have enough space on the GPU to store the entire model on your GPUs. This uses big model inference under the hood. Check this for more details.

It is not possible to perform pure 8bit or 4bit training on these models. However, you can train these models by leveraging parameter efficient fine tuning methods (PEFT) and train for example adapters on top of them. Please have a look at library for more details.

Currently, you can’t add adapters on top of any quantized model. However, with the official support of adapters with 🌍 Transformers models, you can fine-tune quantized models. If you want to finetune a 🌍 Transformers model , follow this instead. Check out this on how to fine-tune a 4-bi 🌍 Transformers model.

Check out the Google Colab for running quantized models on a GTP2 model. The GPT2-1.5B model checkpoint is in FP32 which uses 6GB of memory. After quantization, it uses 1.6GB with 8-bit modules and 1.2GB with 4-bit modules.

🌍
documentation
8-bit quantization
4-bit quantization
init_empty_weights()
BnbQuantizationConfig
load_and_quantize_model()
save_model()
documentation
peft
documentation
demo
demo