Quantization LLM Template

Quantization LLM Template#

class quark.torch.quantization.config.template.QuantizationScheme(config: QLayerConfig)[source]#: Abstract base class for quantization schemes.

class quark.torch.quantization.config.template.Int4WeightOnlyScheme(group_size: int)[source]#: Scheme for INT4 weight-only quantization.

class quark.torch.quantization.config.template.Int4WeightAndActivationScheme(group_size: int)[source]#: Scheme for INT4 weight and activation quantization.

class quark.torch.quantization.config.template.Int4WeightOnlyPerChannelScheme[source]#: Scheme for INT4 weight-only per-channel quantization.

class quark.torch.quantization.config.template.Uint4WeightOnlyScheme(group_size: int)[source]#: Scheme for UINT4 weight-only quantization.

class quark.torch.quantization.config.template.Uint4WeightOnlyPerChannelScheme[source]#: Scheme for UINT4 weight-only per-channel quantization.

class quark.torch.quantization.config.template.Int8Scheme[source]#: Scheme for INT8 weight and activation input quantization.

class quark.torch.quantization.config.template.FP8Scheme[source]#: Scheme for FP8 quantization (e4m3 format).

class quark.torch.quantization.config.template.MXFP4Scheme[source]#: Scheme for MXFP4 quantization.

class quark.torch.quantization.config.template.MXFP4WeightOnlyScheme[source]#: Scheme for weight-only MXFP4 quantization (e.g. gpt-oss source format).

class quark.torch.quantization.config.template.MXFP6E3M2Scheme[source]#: Scheme for MXFP6E3M2 quantization.

class quark.torch.quantization.config.template.MXFP6E2M3Scheme[source]#: Scheme for MXFP6E2M3 quantization.

class quark.torch.quantization.config.template.MXFP4_MXFP6E2M3Scheme[source]#: Scheme for MXFP4 weight and MXFP6E2M3 activation input quantization.

class quark.torch.quantization.config.template.AmdFP4Scheme(group_size: int = 16)[source]#

Scheme for amdfp4 quantization with E5M3 scale format.

Supports only group_size=16 or group_size=32.

class quark.torch.quantization.config.template.MX6Scheme[source]#: Scheme for MX6 quantization.

class quark.torch.quantization.config.template.BFP16Scheme[source]#: Scheme for BFP16 quantization.

class quark.torch.quantization.config.template.MXFP4_FP8Scheme[source]#: Scheme for MXFP4 weight and FP8 activation input quantization.

class quark.torch.quantization.config.template.PTPCFP8Scheme[source]#

Scheme for PTPC FP8 quantization (Dynamic activation per-token quantization, weight quantization per-channel).

Uses FP8 Per-Channel Static for weights and FP8 Per-Token Dynamic for activations.

class quark.torch.quantization.config.template.FP4Block16ScaleE4M3Scheme[source]#

Scheme for FP4 per-group quantization with FP8 E4M3 scale quantization for both weights and activations.

Uses FP4 per-group (group_size=16) with FP8 E4M3 per-tensor scale quantization. This is a two-stage quantization where the scale itself is quantized to FP8 E4M3 format. Weights use static quantization while activations use dynamic quantization.

class quark.torch.quantization.config.template.AmdFP4GlobalScaleScheme(group_size: int)[source]#

Scheme for FP4 per-group quantization with FP8 E5M3 global scale quantization for both weights and activations.

Uses FP4 per-group with FP8 E5M3 per-tensor global scale quantization. This is a two-stage quantization where the scale itself is quantized to FP8 E5M3 format. Weights use static quantization while activations use dynamic quantization.

class quark.torch.quantization.config.template.INT4_FP8Scheme[source]#

Scheme with INT4 weights and FP8 activations (a.k.a. “W4A8”).

The scheme name follows the <weight_format>_<activation_format> convention, the same style as mxfp4_fp8. Concretely:

Weight (4-bit, INT4):

Quantized to INT4 (4-bit signed integer), the final stored weight format.
Quantization is progressive (two stages): the high-precision weight is first quantized to FP8 E4M3 per-tensor, then that result is re-quantized to INT4.
INT4 stage is per-channel (ch_axis=0), symmetric, static (no runtime calibration), using min-max observation, half-even rounding, and a float32 scale.

Activation (8-bit, FP8):

Quantized to FP8 E4M3 (8-bit floating point, 4 exponent / 3 mantissa bits).
Per-tensor, dynamic (scale computed at runtime from each input), min-max, float32 scale.

This matches the AMD-Quark INT4-weight / FP8-activation recipe used by models such as amd/Kimi-K2-Thinking-W4A8.

class quark.torch.quantization.config.template.QuantizationSchemeCollection[source]#

Collection for quantization schemes.

register_scheme(scheme_name: str, scheme: QuantizationScheme) → None[source]#: Register a quantization scheme.

unregister_scheme(scheme_name: str) → None[source]#: Unregister a quantization scheme.

get_supported_schemes() → list[str][source]#: Get list of supported quantization schemes.

get_scheme(scheme_name: str) → QuantizationScheme[source]#: Get a quantization scheme by name.

class quark.torch.quantization.config.template.LLMTemplate(model_type: str, kv_layers_name: list[str] | None = None, q_layer_name: str | list[str] | None = None, gate_up_layers_name: list[str] | None = None, exclude_layers_name: list[str] = [], algorithm_configs: dict[str, AlgoConfig | None] | None = None, f2f_weight_converters: list[WeightConverter] | None = None, **legacy_algorithm_parameters: AlgoConfig | None)[source]#

A configuration template that defines how to quantize specific types of LLM models.

Each LLM architecture (like llama, qwen, deepseek, etc.) has its own unique structure and naming patterns for layers. This template allows specifying those patterns and quantization settings in a reusable way.

Parameters:

model_type (str) – Type of the LLM model.
kv_layers_name (List[str]) – List of k_proj and v_proj layer name patterns to match. Default is None.
q_layer_name (Union[str, List[str]]) – q_proj layer name pattern to match. Default is None.
exclude_layers_name (List[str]) – List of layer name patterns to exclude from quantization. Default is [].
algorithm_configs (Optional[Dict[str, AlgoConfig]]) – Dictionary of algorithm names to algorithm configurations. Example: {"awq": custom_awq_config, "gptq": custom_gptq_config}. Default is None.
legacy_algorithm_parameters (Dict[str, AlgoConfig]) – Legacy keyword arguments in <algorithm>_config form (for backward compatibility). Passing these will emit a deprecation warning. Use algorithm_configs for new code.

Note:

The quantization schemes supported by the template are:
- fp8
- ptpc_fp8
- int4_wo_32
- int4_wo_64
- int4_wo_128
- int4_wo_per_channel
- uint4_wo_32
- uint4_wo_64
- uint4_wo_128
- uint4_wo_per_channel
- int8
- mxfp4
- mxfp6_e3m2
- mxfp6_e2m3
- mx6
- bfp16
- int4_fp8
The quantization algorithms supported by the template are:
- awq
- gptq
- gptaq
- smoothquant
- autosmoothquant
- qronos
- rotation
The KV cache schemes supported by the template are:
- fp8
The attention schemes supported by the template are:
- fp8

Creating a Custom Template:

To create a custom template for a new model type, you can define layer name patterns and algorithm configurations specific to your model architecture. Take moonshotai/Kimi-K2-Instruct as an example:

from quark.torch import LLMTemplate

# Create a new template
template = LLMTemplate(
    model_type="kimi_k2",
    kv_layers_name=["*kv_b_proj"],
    exclude_layers_name=["lm_head"]
)

# Register the template to LLMTemplate class (optional, if you want to use the template in other places)
LLMTemplate.register_template(template)

classmethod list_available() → list[str][source]#

List all available model names of registered templates.

Returns:: List of template names.
Return type:: List[str]

Example:

from quark.torch import LLMTemplate

templates = LLMTemplate.list_available()
print(templates)  # ['llama', 'opt', 'gpt2', ...]

classmethod register_template(template: LLMTemplate) → None[source]#

Register a template.

Parameters:: template (LLMTemplate) – The template to register.

Example:

from quark.torch import LLMTemplate

# Create template
template = LLMTemplate(
    model_type="llama",
    kv_layers_name=["*k_proj", "*v_proj"],
    q_layer_name="*q_proj",
    exclude_layers_name=["lm_head"],
)

# Register template
LLMTemplate.register_template(template)

classmethod get(model_type: str) → LLMTemplate[source]#

Get a template by model type.

Parameters:: model_type (str) – Type of the model. It is obtained from the original LLM HuggingFace model’s model.config.model_type attribute. When the model_type field is not defined, the model.config.architecture[0] is assigned as the model_type..

Available model types:

chatglm

cohere

dbrx

deepseek

deepseek_v2

deepseek_v3

deepseek_v32

deepseek_v4

deepseek_vl_v2

gemma2

gemma3

gemma3_text

glm4_moe

glm4_moe_lite

glm_moe_dsa

gptj

gpt_oss

granitemoehybrid

grok-1

instella

kimi_k2

kimi_k25

llama

llama4

minimax_m2

minimax_m3_vl

mistral

mixtral

mllama

olmo

opt

phi

phi3

qwen

qwen2

qwen2_moe

qwen3

qwen3_moe

qwen3_next

qwen3_vl_moe

qwen3_5_moe

Returns:: The template object.
Return type:: LLMTemplate

Example:

from quark.torch import LLMTemplate

template = LLMTemplate.get("llama")
print(template)

classmethod register_scheme(scheme_name: str, config: QLayerConfig) → None[source]#

Register a new quantization scheme for LLMTemplate class.

Parameters:

scheme_name (str) – Name of the scheme.
config (QLayerConfig) – Configuration for the scheme.

Example:

# Register a new quantization scheme ``int8_wo (int8 weight-only)`` to the template
from quark.torch import LLMTemplate
from quark.torch.quantization.config.config import Int8PerTensorSpec, QLayerConfig

quant_spec = Int8PerTensorSpec(observer_method="min_max", symmetric=True, scale_type="float",
                               round_method="half_even", is_dynamic=False).to_quantization_spec()
global_config = QLayerConfig(weight=quant_spec)

LLMTemplate.register_scheme("int8_wo", config=global_config)

classmethod unregister_scheme(scheme_name: str) → None[source]#

Unregister a quantization scheme.

Parameters:: scheme_name (str) – Name of the scheme to unregister.

Example:

from quark.torch import LLMTemplate

LLMTemplate.unregister_scheme("int8")

classmethod get_supported_schemes() → list[str][source]#: Get list of supported quantization schemes.

get_config(scheme: str, algorithm: str | list[str] | None = None, kv_cache_scheme: str | None = None, min_kv_scale: float = 0.0, attention_scheme: str | None = None, layer_config: dict[str, str] | None = None, layer_type_config: dict[type[Module], str] | None = None, exclude_layers: list[str] | None = None, algo_configs: dict[str, AlgoConfig] | None = None, shared_scale_groups: list[list[str]] | None = None) → QConfig[source]#

Create a quantization configuration based on the provided parameters.

Parameters:

scheme (str) – Name of the quantization scheme.
algorithm (Optional[Union[str, List[str]]]) – Name or list of names of quantization algorithms to apply.
kv_cache_scheme (Optional[str]) – Name of the KV cache quantization scheme.
min_kv_scale (float) – Minimum value of KV Cache scale.
attention_scheme (Optional[str]) – Name of the attention quantization scheme.
layer_config (Optional[Dict[str, str]]) – Dictionary of layer name patterns and quantization scheme names.
layer_type_config (Optional[Dict[Type[nn.Module], str]]) – Dictionary of layer types and quantization scheme names.
exclude_layers (Optional[List[str]]) – List of layer names to exclude from quantization.
algo_configs (Optional[Dict[str, AlgoConfig]]) – Dictionary of algorithm names to their configurations.
shared_scale_groups (Optional[List[List[str]]]) – Groups of layer name suffixes that should share the global-scale observer. Each inner list represents a group of parallel layer suffixes (e.g. ["q_proj", "k_proj", "v_proj"]). If None, the default for the scheme is used. Pass [] to disable.

Example:

from quark.torch import LLMTemplate

template = LLMTemplate.get("llama")
config = template.get_config(scheme="fp8", kv_cache_scheme="fp8")

Quantization LLM Template

Contents

Quantization LLM Template#