QuaRot#
QuaRot is proposed to harmonize the outliers within the input_tensors before MatMul/Gemm. The main idea for QuaRot is to insert Hadamard transformation pairs into input_tensors, hence projecting input_tensors to the Hadamard domain. This projection can make discrete energy concentrated, or make concentrated energy discrete. Due to the discrete distribution of input_tensors, the distribution after the Hadamard transform becomes more concentrated, thereby mitigating the outlier situation and relieving input_tensors quantization error. Experiments show that using the QuaRot technique can improve the PTQ accuracy of LLMs like Llama-2, especially for models with a large number of outliers in the input_tensors.
Here is a simple example showing how to apply the QuaRot algorithm on an A8W8 (Activation-8bit-Weight-8bit) quantization.
from quark.onnx import ModelQuantizer, QConfig, QLayerConfig, UInt8Spec, Int8Spec, QuarotConfig
quant_config = QLayerConfig(input_tensors=UInt8Spec(), weight=Int8Spec())
quarot_config = QuarotConfig(
r_matrix_dim=4096,
use_random_had=False,
r_config_path="rotation_config.json")
config = QConfig(
global_config=quant_config,
algo_config=[quarot_config],
OpTypesToQuantize=['MatMul', 'Gemm'],
)
quantizer = ModelQuantizer(config)
quantizer.quantize_model(input_model_path, quantized_model_path, calib_data_reader)
Arguments#
Here we only list a few important and commonly used arguments, please refer to the documentation of full arguments list for more details.
r_matrix_dim: (Int) Specifies the dimension for constructing rotation matrix. The default value is 4096.
use_random_had: (Boolean) If True, the rotation matrix is generated by the random Hadamard scheme. The default is False.
r_config_path: (String) Sets the path for the rotation config file. This is necessary when using QuaRot. The default is “”.
Example#
Note
For information on accessing AMD Quark ONNX examples, refer to Accessing ONNX Examples.
This example and the relevant files are available at /onnx/accuracy_improvement/quarot
This example demonstrates quantizing a Llama-2-7b-hf model using the AMD Quark ONNX quantizer. It also shows how to use the QuaRot algorithm.