Configuring ONNX Quantization#
Configuration of quantization in Quark for ONNX
is set by Python
dataclass
because it is rigorous and can help users avoid typos. We
provide a class Config
in quark.onnx.quantization.config.config
for configuration, as demonstrated in the example above. In Config
,
users should set certain instances (all instances are optional except
global_quant_config
):
global_quant_config
: Global quantization configuration applied to the entire model.
The Config
should be like:
from quark.onnx.quantization.config.config import Config, get_default_config
config = Config(global_quant_config=...)
We define some default global configurations, including
XINT8
and U8S8_AAWS
, which can be
used like this:
quant_config = get_default_config("U8S8_AAWS")
config = Config(global_quant_config=quant_config)
More Quantization Default Configurations#
Quark for ONNX provides users with the default configurations to quickly start model quantization.
INT8_CNN_DEFAULT
: Perform 8-bits, optimized for CNN quantization.INT16_CNN_DEFAULT
: Perform 16-bits, optimized for CNN quantization.INT8_TRANSFORMER_DEFAULT
: Perform 8-bits, optimized for transformer quantization.INT16_TRANSFORMER_DEFAULT
: Perform 16-bits, optimized for transformer quantization.INT8_CNN_ACCURATE
: Perform 8-bits, optimized for CNN quantization. Some advanced algorithms are applied to achieve higher accuracy, but will consume more time and memory space.INT16_CNN_ACCURATE
: Perform 16-bits, optimized for CNN quantization. Some advanced algorithms are applied to achieve higher accuracy, but will consume more time and memory space.INT8_TRANSFORMER_ACCURATE
: Perform 8-bits, optimized for transformer quantization. Some advanced algorithms are applied to achieve higher accuracy, but will consume more time and memory space.INT16_TRANSFORMER_ACCURATE
: Perform 16-bits, optimized for transformer quantization. Some advanced algorithms are applied to achieve higher accuracy, but will consume more time and memory space.
Quark for ONNX also provides more advanced default configurations to help users to quantize models with more options.
UINT8_DYNAMIC_QUANT
: Perform dynamic activation, uint8 weight quantization.XINT8
: Perform uint8 activation, int8 weight, optimized for NPU quantization.XINT8_ADAROUND
: Perform uint8 activation, int8 weight, optimized for NPU quantization. The adaround fast finetune will be applied to preserve quantized accuracy.XINT8_ADAQUANT
: Perform uint8 activation, int8 weight, optimized for NPU quantization. The adaquant fast finetune will be applied to preserve quantized accuracy.S8S8_AAWS
: Perform int8 asymmetric activation, int8 symmetric weight quantization.S8S8_AAWS_ADAROUND
: Perform int8 asymmetric activation, int8 symmetric weight quantization. The adaround fast finetune will be applied to preserve quantized accuracy.S8S8_AAWS_ADAQUANT
: Perform int8 asymmetric activation, int8 symmetric weight quantization. The adaquant fast finetune will be applied to preserve quantized accuracy.U8S8_AAWS
: Perform uint8 asymmetric activation, int8 symmetric weight quantization.U8S8_AAWS_ADAROUND
: Perform uint8 asymmetric activation, int8 symmetric weight quantization. The adaround fast finetune will be applied to preserve quantized accuracy.U8S8_AAWS_ADAQUANT
: Perform uint8 asymmetric activation, int8 symmetric weight quantization. The adaquant fast finetune will be applied to preserve quantized accuracy.S16S8_ASWS
: Perform int16 symmetric activation, int8 symmetric weight quantization.S16S8_ASWS_ADAROUND
: Perform int16 symmetric activation, int8 symmetric weight quantization. The adaround fast finetune will be applied to preserve quantized accuracy.S16S8_ASWS_ADAQUANT
: Perform int16 symmetric activation, int8 symmetric weight quantization. The adaquant fast finetune will be applied to preserve quantized accuracy.U16S8_AAWS
: Perform uint16 asymmetric activation, int8 symmetric weight quantization.U16S8_AAWS_ADAROUND
: Perform uint16 asymmetric activation, int8 symmetric weight quantization. The adaround fast finetune will be applied to preserve quantized accuracy.U16S8_AAWS_ADAQUANT
: Perform uint16 asymmetric activation, int8 symmetric weight quantization. The adaquant fast finetune will be applied to preserve quantized accuracy.BF16
: Perform bfloat16 activation, bfloat16 weight quantization.BFP16
: Perform BFP16 activation, BFP16 weight quantization.S16S16_MIXED_S8S8
: Perform int16 activation, int16 weight mix-precision quantization.
Customized Configurations#
Besides the default configurations in Quark for ONNX, users can also customize the quantization configuration like the example below. Please refer to Full List of Quantization Config Features for more details.
from quark.onnx import ModelQuantizer, PowerOfTwoMethod, QuantType
from quark.onnx.quantization.config.config import Config, QuantizationConfig
quant_config = QuantizationConfig(
quant_format=quark.onnx.QuantFormat.QDQ,
calibrate_method=quark.onnx.PowerOfTwoMethod.MinMSE,
input_nodes=[],
output_nodes=[],
op_types_to_quantize=[],
per_channel=False,
reduce_range=False,
activation_type=quark.onnx.QuantType.QInt8,
weight_type=quark.onnx.QuantType.QInt8,
nodes_to_quantize=[],
nodes_to_exclude=[],
optimize_model=True,
use_dynamic_quant=False,
use_external_data_format=False,
execution_providers=['CPUExecutionProvider'],
enable_npu_cnn=False,
enable_npu_transformer=False,
convert_fp16_to_fp32=False,
convert_nchw_to_nhwc=False,
include_cle=False,
include_sq=False,
extra_options={},)
config = Config(global_quant_config=quant_config)
quantizer = ModelQuantizer(config)
quantizer.quantize_model(input_model_path, output_model_path, calibration_data_reader=None)