QuaRot

Contents

QuaRot#

QuaRot is proposed to harmonize the outliers within the activations before MatMul/Gemm. The main idea for QuaRot is to insert Hadamard transformation pairs into activations, hence projecting activations to the Hadamard domain. This projection can make discrete energy concentrated, or make concentrated energy discrete. Due to the discrete distribution of activation, the distribution after the Hadamard transform becomes more concentrated, thereby mitigating the outlier situation and relieving activation quantization error. Experiments show that using the QuaRot technique can improve the PTQ accuracy of LLMs like Llama-2, especially for models with a large number of outliers in the activation.

Here is a simple example showing how to apply the QuaRot algorithm on an A8W8 (Activation-8bit-Weight-8bit) quantization.

from quark.onnx import ModelQuantizer
from quark.onnx.quantization.config import QConfig
from quark.onnx.quantization.config.spec import QLayerConfig, UInt8Spec, Int8Spec
from quark.onnx.quantization.config.algorithm import QuarotConfig

quant_config = QLayerConfig(activation=UInt8Spec(), weight=Int8Spec())

quarot_config = QuarotConfig(
                   r_matrix_dim=4096,
                   use_random_had=False,
                   r_config_path="rotation_config.json")

config = QConfig(
    global_config=quant_config,
    algo_config=[quarot_config],
    OpTypesToQuantize=['MatMul', 'Gemm'],
)

quantizer = ModelQuantizer(config)
quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)

Arguments#

Here we only list a few important and commonly used arguments, please refer to the documentation of full arguments list for more details.

  • r_matrix_dim: (Int) Specifies the dimension for constructing rotation matrix. The default value is 4096.

  • use_random_had: (Boolean) If True, the rotation matrix is generated by the random Hadamard scheme. The default is False.

  • r_config_path: (String) Sets the path for the rotation config file. This is necessary when using QuaRot. The default is “”.

Example#

Note

For information on accessing AMD Quark ONNX examples, refer to Accessing ONNX Examples. This example and the relevant files are available at /onnx/accuracy_improvement/quarot

This example demonstrates quantizing a Llama-2-7b-hf model using the AMD Quark ONNX quantizer. It also shows how to use the QuaRot algorithm.