Quantization

Quantization

TGI offers GPTQ and bits-and-bytes quantization to quantize large language models.

Quantization with GPTQ

GPTQ is a post-training quantization method to make the model smaller. It quantizes the layers by finding a compressed version of that weight, that will yield a minimum mean squared error like below 👇

Given a layer �l with weight matrix ��Wl​ and layer input ��Xl​, find quantized weightℎ����hatWl​:

(�^�∗=��������^∣∣���−�^��∣∣22)(W^l​∗=argminWl​^​​∣∣Wl​X−W^l​X∣∣22​)

TGI allows you to both run an already GPTQ quantized model (see available models herearrow-up-right) or quantize a model of your choice using quantization script. You can run a quantized model by simply passing —quantize like below 👇

Copied

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id $model --quantize gptq

Note that TGI’s GPTQ implementation doesn’t use AutoGPTQarrow-up-right under the hood. However, models quantized using AutoGPTQ or Optimum can still be served by TGI.

To quantize a given model using GPTQ with a calibration dataset, simply run

Copied

text-generation-server quantize tiiuae/falcon-40b /data/falcon-40b-gptq
# Add --upload-to-model-id MYUSERNAME/falcon-40b to push the created model to the hub directly

This will create a new directory with the quantized files which you can use with,

Copied

You can learn more about the quantization options by running text-generation-server quantize --help.

If you wish to do more with GPTQ models (e.g. train an adapter on top), you can read about transformers GPTQ integration herearrow-up-right. You can learn more about GPTQ from the paperarrow-up-right.

Quantization with bitsandbytes

bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn’t require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.

8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much. In TGI, you can use 8-bit quantization by adding --quantize bitsandbytes like below 👇

Copied

4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (fp4), or 4-bit NormalFloat (nf4). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load.

In TGI, you can use 4-bit quantization by adding --quantize bitsandbytes-nf4 or --quantize bitsandbytes-fp4 like below 👇

Copied

You can get more information about 8-bit quantization by reading this blog postarrow-up-right, and 4-bit quantization by reading this blog postarrow-up-right.

Last updated