Configuration Description#
Configuration of quantization in Quark for Pytorch
is set by python
dataclass
because it is rigorous and can help users avoid typos. We
provide a class Config
in quark.torch.quantization.config.config
for configuration, as demonstrated in the example above. In Config
,
users should set certain instances (all instances are optional except
global_quant_config):
global_quant_config(QuantizationConfig)
: Global quantization configuration applied to the entire model unless overridden at the layer level.layer_type_quant_config(QuantizationConfig)
: A dictionary mapping from layer types (e.g., ‘Conv2D’, ‘Dense’) to their quantization configurations. Default is an empty dictionary.layer_quant_config(QuantizationConfig)
: A dictionary mapping from layer names to their quantization configurations, allowing for per-layer customization. Default is an empty dictionary.exclude(QuantizationConfig)
: A list of layer names to be excluded from quantization, enabling selective quantization of the model. Default is an empty list.algo_config(AlgoConfig)
: Optional configuration for the quantization algorithm, such as GPTQ and AWQ. After this process, the datatype/fake_datatype of weights will be changed with quantization scales.pre_quant_opt_config(PreProcessConfig)
: Optional pre-processing optimization, such as Equalization and SmoothQuant. After this process, the value of weights will be changed, but the dtype/fake_dtype will be the same.
The Config
should be like:
from quark.torch.quantization.config.config import Config
quant_config = Config(global_quant_config=..., layer_type_quant_config=..., layer_quant_config=...)
Setting QuantizationConfig
#
QuantizationConfig
is used to describe the global, layer-type-wise,
or layer-wise quantization information for each nn.Module
, such as
nn.Linear
, which include:
input_tensors(QuantizationSpec)
: Input tensors quantization specification. If None, following the hierarchical quantization setup. e.g. If the input_tensors in layer_type_quant_config is None, the configuration from global_quant_config will be used instead. Defaults to None. If None in global_quant_config, input_tensors are not quantized.output_tensors(QuantizationSpec)
: Output tensors quantization specification. Defaults to None. If None, the same as above.weight(QuantizationSpec)
: The weights tensors quantization specification. Defaults to None. If None, the same as above.bias(QuantizationSpec)
: The bias tensors quantization specification. Defaults to None. If None, the same as above.target_device(DeviceType)
: Configuration specifying the target device (e.g., CPU, GPU, IPU) for the quantized model.
The QuantizationConfig
should be like:
from quark.torch.quantization.config.config import QuantizationConfig
QuantizationConfig(input_tensors=..., output_tensors=..., weight=..., ...)
Configuring Quantization Strategy (Setting QuantizationSpec
)#
QuantizationSpec
aims to describe the quantization specification for
each tensor. Users can set these features: + dtype
: The data type
for quantization (e.g., int8, int4). + is_dynamic
: Specifies whether
dynamic or static quantization should be used. Default is None, which
indicates no specification. + observer_cls
: The class of observer to
be used for determining quantization parameters like min/max values.
Default is None. + qscheme
: The quantization scheme to use, such as
per_tensor, per_channel or per_group. Default is None. + ch_axis
:
The channel axis for per-channel quantization. Default is None. +
group_size
: The size of the group for per-group quantization.
Default is None. + symmetric
: Indicates if the quantization should
be symmetric around zero. If True, quantization is symmetric. If None,
it defers to a higher-level or global setting. Default is None. +
round_method
: The rounding method during quantization, such as
half_even. If None, it defers to a higher-level or default method.
Default is None. + scale_type
: Defines the scale type to be used for
quantization, like power of two or float. If None, it defers to a
higher-level setting or uses a default method. Default is None.
The QuantizationSpec
should be like:
from quark.torch.quantization.config.config import QuantizationSpec
from quark.torch.quantization.config.type import Dtype, ScaleType, RoundType, QSchemeType
# Per Tensor Config
QuantizationSpec(dtype=Dtype.int8,
qscheme=QSchemeType.per_tensor,
observer_cls=PerTensorMinMaxObserver,
symmetric=True,
scale_type=ScaleType.float,
round_method=RoundType.half_even,
is_dynamic=False)
# Per Channel Config, should set ch_axis
QuantizationSpec(dtype=Dtype.int4,
observer_cls=PerChannelMinMaxObserver,
symmetric=True,
scale_type=ScaleType.float,
round_method=RoundType.half_even,
qscheme=QSchemeType.per_channel,
ch_axis=0,
is_dynamic=False)
# Per Group Config, should set ch_axis and group_size
QuantizationSpec(dtype=Dtype.int4,
observer_cls=PerChannelMinMaxObserver,
symmetric=True,
scale_type=ScaleType.float,
round_method=RoundType.half_even,
qscheme=QSchemeType.per_group,
ch_axis=0,
is_dynamic=False,
group_size=128)
Configuring Calibration Method#
Quark for PyTorch supports these types of calibration methods:
MinMax Calibration method (per tensor/channel/group)
Percentile Calibration method (per tensor)
MSE Calibration method (per tensor)
Users can configuring the calibration method for each tensors in modules
by using the instance observer_cls
in the QuantizationSpec
of
quantization configuration:
QuantizationSpec(...,
observer_cls=PerChannelMinMaxObserver,
...)
Users can choose the observer_cls
in:
PerTensorMinMaxObserver
PerChannelMinMaxObserver
PerTensorPercentileObserver
PerTensorMSEObserver