Quantization LLM Template#

class quark.torch.quantization.config.template.QuantizationScheme(config: QLayerConfig)[source]#

Abstract base class for quantization schemes.

class quark.torch.quantization.config.template.Int4WeightOnlyScheme(group_size: int)[source]#

Scheme for INT4 weight-only quantization.

class quark.torch.quantization.config.template.Int4WeightAndActivationScheme(group_size: int)[source]#

Scheme for INT4 weight and activation quantization.

class quark.torch.quantization.config.template.Int4WeightOnlyPerChannelScheme[source]#

Scheme for INT4 weight-only per-channel quantization.

class quark.torch.quantization.config.template.Uint4WeightOnlyScheme(group_size: int)[source]#

Scheme for UINT4 weight-only quantization.

class quark.torch.quantization.config.template.Uint4WeightOnlyPerChannelScheme[source]#

Scheme for UINT4 weight-only per-channel quantization.

class quark.torch.quantization.config.template.Int8Scheme[source]#

Scheme for INT8 weight and activation input quantization.

class quark.torch.quantization.config.template.FP8Scheme[source]#

Scheme for FP8 quantization (e4m3 format).

class quark.torch.quantization.config.template.MXFP4Scheme[source]#

Scheme for MXFP4 quantization.

class quark.torch.quantization.config.template.MXFP4WeightOnlyScheme[source]#

Scheme for weight-only MXFP4 quantization (e.g. gpt-oss source format).

class quark.torch.quantization.config.template.MXFP6E3M2Scheme[source]#

Scheme for MXFP6E3M2 quantization.

class quark.torch.quantization.config.template.MXFP6E2M3Scheme[source]#

Scheme for MXFP6E2M3 quantization.

class quark.torch.quantization.config.template.MXFP4_MXFP6E2M3Scheme[source]#

Scheme for MXFP4 weight and MXFP6E2M3 activation input quantization.

class quark.torch.quantization.config.template.AmdFP4Scheme(group_size: int = 16)[source]#

Scheme for amdfp4 quantization with E5M3 scale format.

Supports only group_size=16 or group_size=32.

class quark.torch.quantization.config.template.MX6Scheme[source]#

Scheme for MX6 quantization.

class quark.torch.quantization.config.template.BFP16Scheme[source]#

Scheme for BFP16 quantization.

class quark.torch.quantization.config.template.MXFP4_FP8Scheme[source]#

Scheme for MXFP4 weight and FP8 activation input quantization.

class quark.torch.quantization.config.template.PTPCFP8Scheme[source]#

Scheme for PTPC FP8 quantization (Dynamic activation per-token quantization, weight quantization per-channel).

Uses FP8 Per-Channel Static for weights and FP8 Per-Token Dynamic for activations.

class quark.torch.quantization.config.template.FP4Block16ScaleE4M3Scheme[source]#

Scheme for FP4 per-group quantization with FP8 E4M3 scale quantization for both weights and activations.

Uses FP4 per-group (group_size=16) with FP8 E4M3 per-tensor scale quantization. This is a two-stage quantization where the scale itself is quantized to FP8 E4M3 format. Weights use static quantization while activations use dynamic quantization.

class quark.torch.quantization.config.template.AmdFP4GlobalScaleScheme(group_size: int)[source]#

Scheme for FP4 per-group quantization with FP8 E5M3 global scale quantization for both weights and activations.

Uses FP4 per-group with FP8 E5M3 per-tensor global scale quantization. This is a two-stage quantization where the scale itself is quantized to FP8 E5M3 format. Weights use static quantization while activations use dynamic quantization.

class quark.torch.quantization.config.template.INT4_FP8Scheme[source]#

Scheme with INT4 weights and FP8 activations (a.k.a. “W4A8”).

The scheme name follows the <weight_format>_<activation_format> convention, the same style as mxfp4_fp8. Concretely:

Weight (4-bit, INT4):
  • Quantized to INT4 (4-bit signed integer), the final stored weight format.

  • Quantization is progressive (two stages): the high-precision weight is first quantized to FP8 E4M3 per-tensor, then that result is re-quantized to INT4.

  • INT4 stage is per-channel (ch_axis=0), symmetric, static (no runtime calibration), using min-max observation, half-even rounding, and a float32 scale.

Activation (8-bit, FP8):
  • Quantized to FP8 E4M3 (8-bit floating point, 4 exponent / 3 mantissa bits).

  • Per-tensor, dynamic (scale computed at runtime from each input), min-max, float32 scale.

This matches the AMD-Quark INT4-weight / FP8-activation recipe used by models such as amd/Kimi-K2-Thinking-W4A8.

class quark.torch.quantization.config.template.QuantizationSchemeCollection[source]#

Collection for quantization schemes.

register_scheme(scheme_name: str, scheme: QuantizationScheme) None[source]#

Register a quantization scheme.

unregister_scheme(scheme_name: str) None[source]#

Unregister a quantization scheme.

get_supported_schemes() list[str][source]#

Get list of supported quantization schemes.

get_scheme(scheme_name: str) QuantizationScheme[source]#

Get a quantization scheme by name.

class quark.torch.quantization.config.template.LLMTemplate(model_type: str, kv_layers_name: list[str] | None = None, q_layer_name: str | list[str] | None = None, gate_up_layers_name: list[str] | None = None, exclude_layers_name: list[str] = [], algorithm_configs: dict[str, AlgoConfig | None] | None = None, f2f_weight_converters: list[WeightConverter] | None = None, **legacy_algorithm_parameters: AlgoConfig | None)[source]#

A configuration template that defines how to quantize specific types of LLM models.

Each LLM architecture (like llama, qwen, deepseek, etc.) has its own unique structure and naming patterns for layers. This template allows specifying those patterns and quantization settings in a reusable way.

Parameters:
  • model_type (str) – Type of the LLM model.

  • kv_layers_name (List[str]) – List of k_proj and v_proj layer name patterns to match. Default is None.

  • q_layer_name (Union[str, List[str]]) – q_proj layer name pattern to match. Default is None.

  • exclude_layers_name (List[str]) – List of layer name patterns to exclude from quantization. Default is [].

  • algorithm_configs (Optional[Dict[str, AlgoConfig]]) – Dictionary of algorithm names to algorithm configurations. Example: {"awq": custom_awq_config, "gptq": custom_gptq_config}. Default is None.

  • legacy_algorithm_parameters (Dict[str, AlgoConfig]) – Legacy keyword arguments in <algorithm>_config form (for backward compatibility). Passing these will emit a deprecation warning. Use algorithm_configs for new code.

Note:
  • The quantization schemes supported by the template are:
    • fp8

    • ptpc_fp8

    • int4_wo_32

    • int4_wo_64

    • int4_wo_128

    • int4_wo_per_channel

    • uint4_wo_32

    • uint4_wo_64

    • uint4_wo_128

    • uint4_wo_per_channel

    • int8

    • mxfp4

    • mxfp6_e3m2

    • mxfp6_e2m3

    • mx6

    • bfp16

    • int4_fp8

  • The quantization algorithms supported by the template are:
    • awq

    • gptq

    • gptaq

    • smoothquant

    • autosmoothquant

    • qronos

    • rotation

  • The KV cache schemes supported by the template are:
    • fp8

  • The attention schemes supported by the template are:
    • fp8

Creating a Custom Template:

To create a custom template for a new model type, you can define layer name patterns and algorithm configurations specific to your model architecture. Take moonshotai/Kimi-K2-Instruct as an example:

from quark.torch import LLMTemplate

# Create a new template
template = LLMTemplate(
    model_type="kimi_k2",
    kv_layers_name=["*kv_b_proj"],
    exclude_layers_name=["lm_head"]
)

# Register the template to LLMTemplate class (optional, if you want to use the template in other places)
LLMTemplate.register_template(template)
classmethod list_available() list[str][source]#

List all available model names of registered templates.

Returns:

List of template names.

Return type:

List[str]

Example:

from quark.torch import LLMTemplate

templates = LLMTemplate.list_available()
print(templates)  # ['llama', 'opt', 'gpt2', ...]
classmethod register_template(template: LLMTemplate) None[source]#

Register a template.

Parameters:

template (LLMTemplate) – The template to register.

Example:

from quark.torch import LLMTemplate

# Create template
template = LLMTemplate(
    model_type="llama",
    kv_layers_name=["*k_proj", "*v_proj"],
    q_layer_name="*q_proj",
    exclude_layers_name=["lm_head"],
)

# Register template
LLMTemplate.register_template(template)
classmethod get(model_type: str) LLMTemplate[source]#

Get a template by model type.

Parameters:

model_type (str) – Type of the model. It is obtained from the original LLM HuggingFace model’s model.config.model_type attribute. When the model_type field is not defined, the model.config.architecture[0] is assigned as the model_type..

Available model types:

  • chatglm

  • cohere

  • dbrx

  • deepseek

  • deepseek_v2

  • deepseek_v3

  • deepseek_v32

  • deepseek_v4

  • deepseek_vl_v2

  • gemma2

  • gemma3

  • gemma3_text

  • glm4_moe

  • glm4_moe_lite

  • glm_moe_dsa

  • gptj

  • gpt_oss

  • granitemoehybrid

  • grok-1

  • instella

  • kimi_k2

  • kimi_k25

  • llama

  • llama4

  • minimax_m2

  • minimax_m3_vl

  • mistral

  • mixtral

  • mllama

  • olmo

  • opt

  • phi

  • phi3

  • qwen

  • qwen2

  • qwen2_moe

  • qwen3

  • qwen3_moe

  • qwen3_next

  • qwen3_vl_moe

  • qwen3_5_moe

Returns:

The template object.

Return type:

LLMTemplate

Example:

from quark.torch import LLMTemplate

template = LLMTemplate.get("llama")
print(template)
classmethod register_scheme(scheme_name: str, config: QLayerConfig) None[source]#

Register a new quantization scheme for LLMTemplate class.

Parameters:
  • scheme_name (str) – Name of the scheme.

  • config (QLayerConfig) – Configuration for the scheme.

Example:

# Register a new quantization scheme ``int8_wo (int8 weight-only)`` to the template
from quark.torch import LLMTemplate
from quark.torch.quantization.config.config import Int8PerTensorSpec, QLayerConfig

quant_spec = Int8PerTensorSpec(observer_method="min_max", symmetric=True, scale_type="float",
                               round_method="half_even", is_dynamic=False).to_quantization_spec()
global_config = QLayerConfig(weight=quant_spec)

LLMTemplate.register_scheme("int8_wo", config=global_config)
classmethod unregister_scheme(scheme_name: str) None[source]#

Unregister a quantization scheme.

Parameters:

scheme_name (str) – Name of the scheme to unregister.

Example:

from quark.torch import LLMTemplate

LLMTemplate.unregister_scheme("int8")
classmethod get_supported_schemes() list[str][source]#

Get list of supported quantization schemes.

get_config(scheme: str, algorithm: str | list[str] | None = None, kv_cache_scheme: str | None = None, min_kv_scale: float = 0.0, attention_scheme: str | None = None, layer_config: dict[str, str] | None = None, layer_type_config: dict[type[Module], str] | None = None, exclude_layers: list[str] | None = None, algo_configs: dict[str, AlgoConfig] | None = None, shared_scale_groups: list[list[str]] | None = None) QConfig[source]#

Create a quantization configuration based on the provided parameters.

Parameters:
  • scheme (str) – Name of the quantization scheme.

  • algorithm (Optional[Union[str, List[str]]]) – Name or list of names of quantization algorithms to apply.

  • kv_cache_scheme (Optional[str]) – Name of the KV cache quantization scheme.

  • min_kv_scale (float) – Minimum value of KV Cache scale.

  • attention_scheme (Optional[str]) – Name of the attention quantization scheme.

  • layer_config (Optional[Dict[str, str]]) – Dictionary of layer name patterns and quantization scheme names.

  • layer_type_config (Optional[Dict[Type[nn.Module], str]]) – Dictionary of layer types and quantization scheme names.

  • exclude_layers (Optional[List[str]]) – List of layer names to exclude from quantization.

  • algo_configs (Optional[Dict[str, AlgoConfig]]) – Dictionary of algorithm names to their configurations.

  • shared_scale_groups (Optional[List[List[str]]]) – Groups of layer name suffixes that should share the global-scale observer. Each inner list represents a group of parallel layer suffixes (e.g. ["q_proj", "k_proj", "v_proj"]). If None, the default for the scheme is used. Pass [] to disable.

Example:

from quark.torch import LLMTemplate

template = LLMTemplate.get("llama")
config = template.get_config(scheme="fp8", kv_cache_scheme="fp8")