Configuration Description#
Configuration of quantization in Quark for Pytorch is set by python
dataclass because it is rigorous and can help users avoid typos. We
provide a class Config in quark.torch.quantization.config.config
for configuration, as demonstrated in the example above. In Config,
users should set certain instances (all instances are optional except
global_quant_config):
global_quant_config(QuantizationConfig): Global quantization configuration applied to the entire model unless overridden at the layer level.layer_type_quant_config(QuantizationConfig): A dictionary mapping from layer types (e.g., ‘Conv2D’, ‘Dense’) to their quantization configurations. Default is an empty dictionary.layer_quant_config(QuantizationConfig): A dictionary mapping from layer names to their quantization configurations, allowing for per-layer customization. Default is an empty dictionary.exclude(QuantizationConfig): A list of layer names to be excluded from quantization, enabling selective quantization of the model. Default is an empty list.algo_config(AlgoConfig): Optional configuration for the quantization algorithm, such as GPTQ and AWQ. After this process, the datatype/fake_datatype of weights will be changed with quantization scales.pre_quant_opt_config(PreProcessConfig): Optional pre-processing optimization, such as Equalization and SmoothQuant. After this process, the value of weights will be changed, but the dtype/fake_dtype will be the same.
The Config should be like:
from quark.torch.quantization.config.config import Config
quant_config = Config(global_quant_config=..., layer_type_quant_config=..., layer_quant_config=...)
Setting QuantizationConfig#
QuantizationConfig is used to describe the global, layer-type-wise,
or layer-wise quantization information for each nn.Module, such as
nn.Linear, which include:
input_tensors(QuantizationSpec): Input tensors quantization specification. If None, following the hierarchical quantization setup. e.g. If the input_tensors in layer_type_quant_config is None, the configuration from global_quant_config will be used instead. Defaults to None. If None in global_quant_config, input_tensors are not quantized.output_tensors(QuantizationSpec): Output tensors quantization specification. Defaults to None. If None, the same as above.weight(QuantizationSpec): The weights tensors quantization specification. Defaults to None. If None, the same as above.bias(QuantizationSpec): The bias tensors quantization specification. Defaults to None. If None, the same as above.target_device(DeviceType): Configuration specifying the target device (e.g., CPU, GPU, IPU) for the quantized model.
The QuantizationConfig should be like:
from quark.torch.quantization.config.config import QuantizationConfig
QuantizationConfig(input_tensors=..., output_tensors=..., weight=..., ...)
Configuring Quantization Strategy (Setting QuantizationSpec)#
QuantizationSpec aims to describe the quantization specification for
each tensor. Users can set these features: + dtype: The data type
for quantization (e.g., int8, int4). + is_dynamic: Specifies whether
dynamic or static quantization should be used. Default is None, which
indicates no specification. + observer_cls: The class of observer to
be used for determining quantization parameters like min/max values.
Default is None. + qscheme: The quantization scheme to use, such as
per_tensor, per_channel or per_group. Default is None. + ch_axis:
The channel axis for per-channel quantization. Default is None. +
group_size: The size of the group for per-group quantization.
Default is None. + symmetric: Indicates if the quantization should
be symmetric around zero. If True, quantization is symmetric. If None,
it defers to a higher-level or global setting. Default is None. +
round_method: The rounding method during quantization, such as
half_even. If None, it defers to a higher-level or default method.
Default is None. + scale_type: Defines the scale type to be used for
quantization, like power of two or float. If None, it defers to a
higher-level setting or uses a default method. Default is None.
The QuantizationSpec should be like:
from quark.torch.quantization.config.config import QuantizationSpec
from quark.torch.quantization.config.type import Dtype, ScaleType, RoundType, QSchemeType
# Per Tensor Config
QuantizationSpec(dtype=Dtype.int8,
qscheme=QSchemeType.per_tensor,
observer_cls=PerTensorMinMaxObserver,
symmetric=True,
scale_type=ScaleType.float,
round_method=RoundType.half_even,
is_dynamic=False)
# Per Channel Config, should set ch_axis
QuantizationSpec(dtype=Dtype.int4,
observer_cls=PerChannelMinMaxObserver,
symmetric=True,
scale_type=ScaleType.float,
round_method=RoundType.half_even,
qscheme=QSchemeType.per_channel,
ch_axis=0,
is_dynamic=False)
# Per Group Config, should set ch_axis and group_size
QuantizationSpec(dtype=Dtype.int4,
observer_cls=PerChannelMinMaxObserver,
symmetric=True,
scale_type=ScaleType.float,
round_method=RoundType.half_even,
qscheme=QSchemeType.per_group,
ch_axis=0,
is_dynamic=False,
group_size=128)
Configuring Calibration Method#
Quark for PyTorch supports these types of calibration methods:
MinMax Calibration method (per tensor/channel/group)
Percentile Calibration method (per tensor)
MSE Calibration method (per tensor)
Users can configuring the calibration method for each tensors in modules
by using the instance observer_cls in the QuantizationSpec of
quantization configuration:
QuantizationSpec(...,
observer_cls=PerChannelMinMaxObserver,
...)
Users can choose the observer_cls in:
PerTensorMinMaxObserverPerChannelMinMaxObserverPerTensorPercentileObserverPerTensorMSEObserver