Quantization LLM Template#
- class quark.torch.quantization.config.template.QuantizationScheme(config: QLayerConfig)[source]#
Abstract base class for quantization schemes.
- class quark.torch.quantization.config.template.Int4WeightOnlyScheme(group_size: int)[source]#
Scheme for INT4 weight-only quantization.
- class quark.torch.quantization.config.template.Int4WeightAndActivationScheme(group_size: int)[source]#
Scheme for INT4 weight and activation quantization.
- class quark.torch.quantization.config.template.Int4WeightOnlyPerChannelScheme[source]#
Scheme for INT4 weight-only per-channel quantization.
- class quark.torch.quantization.config.template.Uint4WeightOnlyScheme(group_size: int)[source]#
Scheme for UINT4 weight-only quantization.
- class quark.torch.quantization.config.template.Uint4WeightOnlyPerChannelScheme[source]#
Scheme for UINT4 weight-only per-channel quantization.
- class quark.torch.quantization.config.template.Int8Scheme[source]#
Scheme for INT8 weight and activation input quantization.
- class quark.torch.quantization.config.template.FP8Scheme[source]#
Scheme for FP8 quantization (e4m3 format).
- class quark.torch.quantization.config.template.MXFP4WeightOnlyScheme[source]#
Scheme for weight-only MXFP4 quantization (e.g. gpt-oss source format).
- class quark.torch.quantization.config.template.MXFP6E3M2Scheme[source]#
Scheme for MXFP6E3M2 quantization.
- class quark.torch.quantization.config.template.MXFP6E2M3Scheme[source]#
Scheme for MXFP6E2M3 quantization.
- class quark.torch.quantization.config.template.MXFP4_MXFP6E2M3Scheme[source]#
Scheme for MXFP4 weight and MXFP6E2M3 activation input quantization.
- class quark.torch.quantization.config.template.AmdFP4Scheme(group_size: int = 16)[source]#
Scheme for amdfp4 quantization with E5M3 scale format.
Supports only
group_size=16orgroup_size=32.
- class quark.torch.quantization.config.template.MXFP4_FP8Scheme[source]#
Scheme for MXFP4 weight and FP8 activation input quantization.
- class quark.torch.quantization.config.template.PTPCFP8Scheme[source]#
Scheme for PTPC FP8 quantization (Dynamic activation per-token quantization, weight quantization per-channel).
Uses FP8 Per-Channel Static for weights and FP8 Per-Token Dynamic for activations.
- class quark.torch.quantization.config.template.FP4Block16ScaleE4M3Scheme[source]#
Scheme for FP4 per-group quantization with FP8 E4M3 scale quantization for both weights and activations.
Uses FP4 per-group (group_size=16) with FP8 E4M3 per-tensor scale quantization. This is a two-stage quantization where the scale itself is quantized to FP8 E4M3 format. Weights use static quantization while activations use dynamic quantization.
- class quark.torch.quantization.config.template.AmdFP4GlobalScaleScheme(group_size: int)[source]#
Scheme for FP4 per-group quantization with FP8 E5M3 global scale quantization for both weights and activations.
Uses FP4 per-group with FP8 E5M3 per-tensor global scale quantization. This is a two-stage quantization where the scale itself is quantized to FP8 E5M3 format. Weights use static quantization while activations use dynamic quantization.
- class quark.torch.quantization.config.template.INT4_FP8Scheme[source]#
Scheme with INT4 weights and FP8 activations (a.k.a. “W4A8”).
The scheme name follows the
<weight_format>_<activation_format>convention, the same style asmxfp4_fp8. Concretely:- Weight (4-bit, INT4):
Quantized to INT4 (4-bit signed integer), the final stored weight format.
Quantization is progressive (two stages): the high-precision weight is first quantized to FP8 E4M3 per-tensor, then that result is re-quantized to INT4.
INT4 stage is per-channel (ch_axis=0), symmetric, static (no runtime calibration), using min-max observation, half-even rounding, and a float32 scale.
- Activation (8-bit, FP8):
Quantized to FP8 E4M3 (8-bit floating point, 4 exponent / 3 mantissa bits).
Per-tensor, dynamic (scale computed at runtime from each input), min-max, float32 scale.
This matches the AMD-Quark INT4-weight / FP8-activation recipe used by models such as
amd/Kimi-K2-Thinking-W4A8.
- class quark.torch.quantization.config.template.QuantizationSchemeCollection[source]#
Collection for quantization schemes.
- register_scheme(scheme_name: str, scheme: QuantizationScheme) None[source]#
Register a quantization scheme.
- get_scheme(scheme_name: str) QuantizationScheme[source]#
Get a quantization scheme by name.
- class quark.torch.quantization.config.template.LLMTemplate(model_type: str, kv_layers_name: list[str] | None = None, q_layer_name: str | list[str] | None = None, gate_up_layers_name: list[str] | None = None, exclude_layers_name: list[str] = [], algorithm_configs: dict[str, AlgoConfig | None] | None = None, f2f_weight_converters: list[WeightConverter] | None = None, **legacy_algorithm_parameters: AlgoConfig | None)[source]#
A configuration template that defines how to quantize specific types of LLM models.
Each LLM architecture (like llama, qwen, deepseek, etc.) has its own unique structure and naming patterns for layers. This template allows specifying those patterns and quantization settings in a reusable way.
- Parameters:
model_type (str) – Type of the LLM model.
kv_layers_name (List[str]) – List of k_proj and v_proj layer name patterns to match. Default is
None.q_layer_name (Union[str, List[str]]) – q_proj layer name pattern to match. Default is
None.exclude_layers_name (List[str]) – List of layer name patterns to exclude from quantization. Default is
[].algorithm_configs (Optional[Dict[str, AlgoConfig]]) – Dictionary of algorithm names to algorithm configurations. Example:
{"awq": custom_awq_config, "gptq": custom_gptq_config}. Default isNone.legacy_algorithm_parameters (Dict[str, AlgoConfig]) – Legacy keyword arguments in
<algorithm>_configform (for backward compatibility). Passing these will emit a deprecation warning. Usealgorithm_configsfor new code.
- Note:
- The quantization schemes supported by the template are:
fp8
ptpc_fp8
int4_wo_32
int4_wo_64
int4_wo_128
int4_wo_per_channel
uint4_wo_32
uint4_wo_64
uint4_wo_128
uint4_wo_per_channel
int8
mxfp4
mxfp6_e3m2
mxfp6_e2m3
mx6
bfp16
int4_fp8
- The quantization algorithms supported by the template are:
awq
gptq
gptaq
smoothquant
autosmoothquant
qronos
rotation
- The KV cache schemes supported by the template are:
fp8
- The attention schemes supported by the template are:
fp8
Creating a Custom Template:
To create a custom template for a new model type, you can define layer name patterns and algorithm configurations specific to your model architecture. Take moonshotai/Kimi-K2-Instruct as an example:
from quark.torch import LLMTemplate # Create a new template template = LLMTemplate( model_type="kimi_k2", kv_layers_name=["*kv_b_proj"], exclude_layers_name=["lm_head"] ) # Register the template to LLMTemplate class (optional, if you want to use the template in other places) LLMTemplate.register_template(template)
- classmethod list_available() list[str][source]#
List all available model names of registered templates.
- Returns:
List of template names.
- Return type:
List[str]
Example:
from quark.torch import LLMTemplate templates = LLMTemplate.list_available() print(templates) # ['llama', 'opt', 'gpt2', ...]
- classmethod register_template(template: LLMTemplate) None[source]#
Register a template.
- Parameters:
template (LLMTemplate) – The template to register.
Example:
from quark.torch import LLMTemplate # Create template template = LLMTemplate( model_type="llama", kv_layers_name=["*k_proj", "*v_proj"], q_layer_name="*q_proj", exclude_layers_name=["lm_head"], ) # Register template LLMTemplate.register_template(template)
- classmethod get(model_type: str) LLMTemplate[source]#
Get a template by model type.
- Parameters:
model_type (str) – Type of the model. It is obtained from the original LLM HuggingFace model’s
model.config.model_typeattribute. When the model_type field is not defined, themodel.config.architecture[0]is assigned as the model_type..
Available model types:
chatglm
cohere
dbrx
deepseek
deepseek_v2
deepseek_v3
deepseek_v32
deepseek_v4
deepseek_vl_v2
gemma2
gemma3
gemma3_text
glm4_moe
glm4_moe_lite
glm_moe_dsa
gptj
gpt_oss
granitemoehybrid
grok-1
instella
kimi_k2
kimi_k25
llama
llama4
minimax_m2
minimax_m3_vl
mistral
mixtral
mllama
olmo
opt
phi
phi3
qwen
qwen2
qwen2_moe
qwen3
qwen3_moe
qwen3_next
qwen3_vl_moe
qwen3_5_moe
- Returns:
The template object.
- Return type:
Example:
from quark.torch import LLMTemplate template = LLMTemplate.get("llama") print(template)
- classmethod register_scheme(scheme_name: str, config: QLayerConfig) None[source]#
Register a new quantization scheme for LLMTemplate class.
- Parameters:
scheme_name (str) – Name of the scheme.
config (QLayerConfig) – Configuration for the scheme.
Example:
# Register a new quantization scheme ``int8_wo (int8 weight-only)`` to the template from quark.torch import LLMTemplate from quark.torch.quantization.config.config import Int8PerTensorSpec, QLayerConfig quant_spec = Int8PerTensorSpec(observer_method="min_max", symmetric=True, scale_type="float", round_method="half_even", is_dynamic=False).to_quantization_spec() global_config = QLayerConfig(weight=quant_spec) LLMTemplate.register_scheme("int8_wo", config=global_config)
- classmethod unregister_scheme(scheme_name: str) None[source]#
Unregister a quantization scheme.
- Parameters:
scheme_name (str) – Name of the scheme to unregister.
Example:
from quark.torch import LLMTemplate LLMTemplate.unregister_scheme("int8")
- get_config(scheme: str, algorithm: str | list[str] | None = None, kv_cache_scheme: str | None = None, min_kv_scale: float = 0.0, attention_scheme: str | None = None, layer_config: dict[str, str] | None = None, layer_type_config: dict[type[Module], str] | None = None, exclude_layers: list[str] | None = None, algo_configs: dict[str, AlgoConfig] | None = None, shared_scale_groups: list[list[str]] | None = None) QConfig[source]#
Create a quantization configuration based on the provided parameters.
- Parameters:
scheme (str) – Name of the quantization scheme.
algorithm (Optional[Union[str, List[str]]]) – Name or list of names of quantization algorithms to apply.
kv_cache_scheme (Optional[str]) – Name of the KV cache quantization scheme.
min_kv_scale (float) – Minimum value of KV Cache scale.
attention_scheme (Optional[str]) – Name of the attention quantization scheme.
layer_config (Optional[Dict[str, str]]) – Dictionary of layer name patterns and quantization scheme names.
layer_type_config (Optional[Dict[Type[nn.Module], str]]) – Dictionary of layer types and quantization scheme names.
exclude_layers (Optional[List[str]]) – List of layer names to exclude from quantization.
algo_configs (Optional[Dict[str, AlgoConfig]]) – Dictionary of algorithm names to their configurations.
shared_scale_groups (Optional[List[List[str]]]) – Groups of layer name suffixes that should share the global-scale observer. Each inner list represents a group of parallel layer suffixes (e.g.
["q_proj", "k_proj", "v_proj"]). IfNone, the default for the scheme is used. Pass[]to disable.
Example:
from quark.torch import LLMTemplate template = LLMTemplate.get("llama") config = template.get_config(scheme="fp8", kv_cache_scheme="fp8")