Quantization
Last updated
Last updated
TGI offers GPTQ and bits-and-bytes quantization to quantize large language models.
GPTQ is a post-training quantization method to make the model smaller. It quantizes the layers by finding a compressed version of that weight, that will yield a minimum mean squared error like below 👇
Given a layer �l with weight matrix ��Wl and layer input ��Xl, find quantized weightℎ����hatWl:
(�^�∗=��������^∣∣���−�^��∣∣22)(W^l∗=argminWl^∣∣WlX−W^lX∣∣22)
TGI allows you to both run an already GPTQ quantized model (see available models ) or quantize a model of your choice using quantization script. You can run a quantized model by simply passing —quantize like below 👇
Copied
Note that TGI’s GPTQ implementation doesn’t use under the hood. However, models quantized using AutoGPTQ or Optimum can still be served by TGI.
To quantize a given model using GPTQ with a calibration dataset, simply run
Copied
This will create a new directory with the quantized files which you can use with,
Copied
You can learn more about the quantization options by running text-generation-server quantize --help
.
bitsandbytes is a library used to apply 8-bit and 4-bit quantization to models. Unlike GPTQ quantization, bitsandbytes doesn’t require a calibration dataset or any post-processing – weights are automatically quantized on load. However, inference with bitsandbytes is slower than GPTQ or FP16 precision.
8-bit quantization enables multi-billion parameter scale models to fit in smaller hardware without degrading performance too much. In TGI, you can use 8-bit quantization by adding --quantize bitsandbytes
like below 👇
Copied
4-bit quantization is also possible with bitsandbytes. You can choose one of the following 4-bit data types: 4-bit float (fp4
), or 4-bit NormalFloat
(nf4
). These data types were introduced in the context of parameter-efficient fine-tuning, but you can apply them for inference by automatically converting the model weights on load.
In TGI, you can use 4-bit quantization by adding --quantize bitsandbytes-nf4
or --quantize bitsandbytes-fp4
like below 👇
Copied
If you wish to do more with GPTQ models (e.g. train an adapter on top), you can read about transformers GPTQ integration . You can learn more about GPTQ from the .
You can get more information about 8-bit quantization by reading this , and 4-bit quantization by reading .