QuaRot

Contents

QuaRot#

QuaRot is proposed to harmonize the outliers within the input_tensors before MatMul/Gemm. The main idea for QuaRot is to insert Hadamard transformation pairs into input_tensors, hence projecting input_tensors to the Hadamard domain. This projection can make discrete energy concentrated, or make concentrated energy discrete. Due to the discrete distribution of input_tensors, the distribution after the Hadamard transform becomes more concentrated, thereby mitigating the outlier situation and relieving input_tensors quantization error. Experiments show that using the QuaRot technique can improve the PTQ accuracy of LLMs like Llama-2, especially for models with a large number of outliers in the input_tensors.

Here is a simple example showing how to apply the QuaRot algorithm on an A8W8 (Activation-8bit-Weight-8bit) quantization.

from quark.onnx import ModelQuantizer, QConfig, QLayerConfig, UInt8Spec, Int8Spec, QuarotConfig

quant_config = QLayerConfig(input_tensors=UInt8Spec(), weight=Int8Spec())

quarot_config = QuarotConfig(
                   r_matrix_dim=4096,
                   use_random_had=False,
                   r_config_path="rotation_config.json")

config = QConfig(
    global_config=quant_config,
    algo_config=[quarot_config],
    OpTypesToQuantize=['MatMul', 'Gemm'],
)

quantizer = ModelQuantizer(config)
quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)

Arguments#

Here we only list a few important and commonly used arguments, please refer to the documentation of full arguments list for more details.

  • r_matrix_dim: (Int) Specifies the dimension for constructing rotation matrix. The default value is 4096.

  • use_random_had: (Boolean) If True, the rotation matrix is generated by the random Hadamard scheme. The default is False.

  • r_config_path: (String) Sets the path for the rotation config file. This is necessary when using QuaRot. The default is “”.

Example#

Note

For information on accessing AMD Quark ONNX examples, refer to Accessing ONNX Examples. This example and the relevant files are available at /onnx/accuracy_improvement/quarot

This example demonstrates quantizing a Llama-2-7b-hf model using the AMD Quark ONNX quantizer. It also shows how to use the QuaRot algorithm.