Rotation-based quantization with QuaRot#
QuaRot is a rotation-based quantization method that inserts rotation matrices into a model to reduce outliers. Reducing outliers has the effect of significantly improving quantization accuracy. To explain the idea, consider the vector [1, 10]. This has an “outlier”, 10. If we rotate it by 45 degrees clockwise, we get [7.7782, 6.3640]: the values are closer together, the “outlier” removed. In rotation-based quantization we apply this idea to tensors that are much larger than 2x1 vectors. To be precise, we insert a rotation matrix before quantization, and its inverse after quantization. Thus at a floating point level the network is unchanged, but the quantized network should have much better accuracy.
The QuaRot method uses Hadamard matrices for rotations. An
QuaRot inserts 4 fundamental rotations into the model: we call these rotations as R1, R2, R3 and R4 (see SpinQuant: LLM quantization with learned rotations). R1 and R2 are offline rotations, and are incorporated directly into the weights of the model. R3 and R4 are online operations. They incur a small performance overhead since we are adding new operations into the graph of the model. But using kernels for fast Hadamard transforms, these can be sped up if necessary.
As we can see, R3 and R4 are online operations. R3 is only needed if we are doing KV cache quantization, and R4 is only needed if we are doing activation quantization.
Quark supports the QuaRot method for Llama models by default, and can be run in one line with the quantize_quark.py script. For example, let’s say that we want to quantize Llama 3-8B, both weights and activations, to int8 per tensor, and we want to apply the QuaRot method so that we perform rotations before quantization. Then we can navigate to the folder examples/torch/language_modeling/llm_ptq
and run:
python quantize_quark.py --model_dir meta-llama/Meta-Llama-3-8B --quant_scheme w_int8_a_int8_per_tensor_sym --pre_quantization_optimization quarot
Here are the results for the perplexity of the quantized model Llama-3-8B, with and without quarot:
Quantization Strategy |
Algorithm |
Perplexity (Wikitext-2) |
---|---|---|
no quantization |
6.13908052444458 |
|
w_int8_per_tensor static quantization |
N/A |
6.622321128845215 |
w_int8_per_tensor static quantization |
QuaRot (R1+R2 only) |
6.181324005126953 |
w_int8_a_int8_per_tensor static quantization |
N/A |
253.269912719726 |
w_int8_a_int8_per_tensor static quantization |
QuaRot |
6.6984167098999 |
Let us see an example of creating a QuaRot config file for an LLM such as Qwen, which has a standard decoder-only transformer architecture. Let’s take a look:

As we can see, the V and O projections in the attention block can be accessed as layer.self_attn.v_proj and layer.self_attn.o_proj, respectively, for every layer in the list model.layers. However, notice that the number of input features to the down-projection (intermediate-size) is
{
"name": "quarot",
"online-had": false,
"backbone": "model",
"model_decoder_layers": "model.layers",
"v_proj": "self_attn.v_proj",
"o_proj":"self_attn.o_proj",
"self_attn": "self_attn"
}
Here are the results for the perplexity of the quantized model Qwen2-7B, with and without quarot:
Quantization Strategy |
Algorithm |
Perplexity (Wikitext-2) |
---|---|---|
no quantization |
7.891325950622559 |
|
w_int8_per_tensor static quantization |
N/A |
8.883856773376465 |
w_int8_per_tensor static quantization |
QuaRot (R1+R2 only) |
7.948962688446045 |
w_int8_a_int8_per_tensor static quantization |
N/A |
172.43882751464844 |
w_int8_a_int8_per_tensor static quantization |
QuaRot (R1+R2 only) |
123.24969482421875 |
To further improve W8A8 quantization, we might combine QuaRot with SmoothQuant.