Quark for ONNX - Configuration Description#

Configurations#

Configuration of quantization in Quark for ONNX is set by python dataclass because it is rigorous and can help users avoid typos. We provide a class Config in quark.onnx.quantization.config.config for configuration, as demonstrated in the example above. In Config, users should set certain instances (all instances are optional except global_quant_CONFIG:):

  • global_quant_CONFIG:(QuantizationConfig): Global quantization configuration applied to the entire model.

The Config should be like:

from quark.onnx.quantization.config.config import Config, get_default_config
config = Config(global_quant_config=...)

We defined some default global configrations, including XINT8 and U8S8_AAWS, which can be used like this:

More Quantization Default Configurations#

Quark for ONNX provides user with the default configurations to quickly start model quantization.

  • INT8_CNN_DEFAULT: Perform 8-bits, optimized for CNN quantization.

  • INT16_CNN_DEFAULT: Perform 16-bits, optimized for CNN quantization.

  • INT8_TRANSFORMER_DEFAULT: Perform 8-bits, optimized for transformer quantization.

  • INT16_TRANSFORMER_DEFAULT: Perform 16-bits, optimized for transformer quantization.

  • INT8_CNN_ACCURATE: Perform 8-bits, optimized for CNN quantization. Some adavnced algorithms are applied to achieve higher accuracy, but will consume more time and memory space.

  • INT16_CNN_ACCURATE: Perform 16-bits, optimized for CNN quantization. Some adavnced algorithms are applied to achieve higher accuracy, but will consume more time and memory space.

  • INT8_TRANSFORMER_ACCURATE: Perform 8-bits, optimized for transformer quantization. Some adavnced algorithms are applied to achieve higher accuracy, but will consume more time and memory space.

  • INT16_TRANSFORMER_ACCURATE: Perform 16-bits, optimized for transformer quantization. Some adavnced algorithms are applied to achieve higher accuracy, but will consume more time and memory space.

Quark for ONNX also provide more advanced default configurations to help users to quantize models with more options.

  • UINT8_DYNAMIC_QUANT: Perform dynamic activation, uint8 weight quantization.

  • XINT8: Perform uint8 activation, int8 weight, optimized for NPU quantization.

  • XINT8_ADAROUND: Perform uint8 activation, int8 weight, optimized for NPU quantization. The adaround fast finetune will be applied to perserve quantized accuracy.

  • XINT8_ADAQUANT: Perform uint8 activation, int8 weight, optimized for NPU quantization. The adaquant fast finetune will be applied to perserve quantized accuracy.

  • S8S8_AAWS: Perform int8 asymmetric activation, int8 symmetric weight quantization.

  • S8S8_AAWS_ADAROUND: Perform int8 asymmetric activation, int8 symmetric weight quantization. The adaround fast finetune will be applied to perserve quantized accuracy.

  • S8S8_AAWS_ADAQUANT: Perform int8 asymmetric activation, int8 symmetric weight quantization. The adaquant fast finetune will be applied to perserve quantized accuracy.

  • U8S8_AAWS: Perform uint8 asymmetric activation, int8 symmetric weight quantization.

  • U8S8_AAWS_ADAROUND: Perform uint8 asymmetric activation, int8 symmetric weight quantization. The adaround fast finetune will be applied to perserve quantized accuracy.

  • U8S8_AAWS_ADAQUANT: Perform uint8 asymmetric activation, int8 symmetric weight quantization. The adaquant fast finetune will be applied to perserve quantized accuracy.

  • S16S8_ASWS: Perform int16 symmetric activation, int8 symmetric weight quantization.

  • S16S8_ASWS_ADAROUND: Perform int16 symmetric activation, int8 symmetric weight quantization. The adaround fast finetune will be applied to perserve quantized accuracy.

  • S16S8_ASWS_ADAQUANT: Perform int16 symmetric activation, int8 symmetric weight quantization. The adaquant fast finetune will be applied to perserve quantized accuracy.

  • U16S8_AAWS: Perform uint16 asymmetric activation, int8 symmetric weight quantization.

  • U16S8_AAWS_ADAROUND: Perform uint16 asymmetric activation, int8 symmetric weight quantization. The adaround fast finetune will be applied to perserve quantized accuracy.

  • U16S8_AAWS_ADAQUANT: Perform uint16 asymmetric activation, int8 symmetric weight quantization. The adaquant fast finetune will be applied to perserve quantized accuracy.

  • BF16: Perform bfloat16 activation, bfloat16 weight quantization.

  • BFP16: Perform BFP16 activation, BFP16 weight quantization.

  • S16S16_MIXED_S8S8: Perform int16 activation, int16 weight mix-percision quantization.

Customized Configurations#

Besides the default configurations in Quark ONNX, user can also customize the quantization configuration like the example below. Please refer to Full List of Quantization Config Features for more details.

from quark.onnx import ModelQuantizer, PowerOfTwoMethod, QuantType
from quark.onnx.quantization.config.config import Config, QuantizationConfig

quant_config = QuantizationConfig(
    quant_format=quark.onnx.QuantFormat.QDQ,
    calibrate_method=quark.onnx.PowerOfTwoMethod.MinMSE,
    input_nodes=[],
    output_nodes=[],
    op_types_to_quantize=[],
    random_data_reader_input_shape=[],
    per_channel=False,
    reduce_range=False,
    activation_type=quark.onnx.QuantType.QInt8,
    weight_type=quark.onnx.QuantType.QInt8,
    nodes_to_quantize=[],
    nodes_to_exclude=[],
    optimize_model=True,
    use_dynamic_quant=False,
    use_external_data_format=False,
    execution_providers=['CPUExecutionProvider'],
    enable_npu_cnn=False,
    enable_npu_transformer=False,
    convert_fp16_to_fp32=False,
    convert_nchw_to_nhwc=False,
    include_cle=False,
    include_sq=False,
    extra_options={},)
config = Config(global_quant_config=quant_config)

quantizer = ModelQuantizer(config)
quantizer.quantize_model(input_model_path, output_model_path, calibration_data_reader=None)