Quantization LLM Template#

class quark.torch.quantization.config.template.QuantizationScheme(config: QuantizationConfig)[source]#

Abstract base class for quantization schemes.

class quark.torch.quantization.config.template.Int4WeightOnlyScheme(group_size: int)[source]#

Scheme for INT4 weight-only quantization.

class quark.torch.quantization.config.template.Uint4WeightOnlyScheme(group_size: int)[source]#

Scheme for UINT4 weight-only quantization.

class quark.torch.quantization.config.template.Int8Scheme[source]#

Scheme for INT8 weight and activation input quantization.

class quark.torch.quantization.config.template.FP8Scheme[source]#

Scheme for FP8 quantization (e4m3 format).

class quark.torch.quantization.config.template.MXFP4Scheme[source]#

Scheme for MXFP4 quantization.

class quark.torch.quantization.config.template.MXFP6E3M2Scheme[source]#

Scheme for MXFP6E3M2 quantization.

class quark.torch.quantization.config.template.MXFP6E2M3Scheme[source]#

Scheme for MXFP6E2M3 quantization.

class quark.torch.quantization.config.template.MX6Scheme[source]#

Scheme for MX6 quantization.

class quark.torch.quantization.config.template.BFP16Scheme[source]#

Scheme for BFP16 quantization.

class quark.torch.quantization.config.template.QuantizationSchemeCollection[source]#

Collection for quantization schemes.

register_scheme(scheme_name: str, scheme: QuantizationScheme) None[source]#

Register a quantization scheme.

unregister_scheme(scheme_name: str) None[source]#

Unregister a quantization scheme.

get_supported_schemes() list[str][source]#

Get list of supported quantization schemes.

get_scheme(scheme_name: str) QuantizationScheme[source]#

Get a quantization scheme by name.

class quark.torch.quantization.config.template.LLMTemplate(model_type: str, kv_layers_name: list[str] | None = None, q_layer_name: str | list[str] | None = None, exclude_layers_name: list[str] = [], awq_config: AWQConfig | None = None, gptq_config: GPTQConfig | None = None, smoothquant_config: SmoothQuantConfig | None = None, autosmoothquant_config: AutoSmoothQuantConfig | None = None, rotation_config: RotationConfig | None = None)[source]#

A configuration template that defines how to quantize specific types of LLM models.

Each LLM architecture (like llama, qwen, deepseek, etc.) has its own unique structure and naming patterns for layers. This template allows specifying those patterns and quantization settings in a reusable way.

Parameters:
  • model_type (str) – Type of the LLM model.

  • kv_layers_name (List[str]) – List of k_proj and v_proj layer name patterns to match. Default is None.

  • q_layer_name (Union[str, List[str]]) – q_proj layer name pattern to match. Default is None.

  • exclude_layers_name (List[str]) – List of layer name patterns to exclude from quantization. Default is [].

  • awq_config (AWQConfig) – Configuration for AWQ algorithm. Default is None.

  • gptq_config (GPTQConfig) – Configuration for GPTQ algorithm. Default is None.

  • smoothquant_config (SmoothQuantConfig) – Configuration for SmoothQuant algorithm. Default is None.

  • autosmoothquant_config (AutoSmoothQuantConfig) – Configuration for AutoSmoothQuant algorithm. Default is None.

  • rotation_config (RotationConfig) – Configuration for Rotation algorithm. Default is None.

Note:
  • The quantization schemes supported by the template are:
    • fp8

    • int4_wo_32

    • int4_wo_64

    • int4_wo_128

    • uint4_wo_32

    • uint4_wo_64

    • uint4_wo_128

    • int8

    • mxfp4

    • mxfp6_e3m2

    • mxfp6_e2m3

    • mx6

    • bfp16

  • The quantization algorithms supported by the template are:
    • awq

    • gptq

    • smoothquant

    • autosmoothquant

    • rotation

  • The KV cache schemes supported by the template are:
    • fp8

  • The attention schemes supported by the template are:
    • fp8

Creating a Custom Template:

To create a custom template for a new model type, you can define layer name patterns and algorithm configurations specific to your model architecture. Take moonshotai/Kimi-K2-Instruct as an example:

from quark.torch import LLMTemplate

# Create a new template
template = LLMTemplate(
    model_type="kimi_k2",
    kv_layers_name=["*kv_b_proj"],
    exclude_layers_name=["lm_head"]
)

# Register the template to LLMTemplate class (optional, if you want to use the template in other places)
LLMTemplate.register_template(template)
classmethod list_available() list[str][source]#

List all available model names of registered templates.

Returns:

List of template names.

Return type:

List[str]

Example:

from quark.torch import LLMTemplate

templates = LLMTemplate.list_available()
print(templates)  # ['llama', 'opt', 'gpt2', ...]
classmethod register_template(template: LLMTemplate) None[source]#

Register a template.

Parameters:

template (LLMTemplate) – The template to register.

Example:

from quark.torch import LLMTemplate

# Create template
template = LLMTemplate(
    model_type="llama",
    kv_layers_name=["*k_proj", "*v_proj"],
    q_layer_name="*q_proj"
)

# Register template
LLMTemplate.register_template(template)
classmethod get(model_type: str) LLMTemplate[source]#

Get a template by model type.

Parameters:

model_type (str) – Type of the model. It is obtained from the original LLM HuggingFace model’s model.config.model_type attribute. When the model_type field is not defined, the model.config.architecture[0] is assigned as the model_type..

Available model types:

  • llama

  • mllama

  • llama4

  • opt

  • qwen2_moe

  • qwen2

  • qwen

  • chatglm

  • phi3

  • phi

  • mistral

  • mixtral

  • gptj

  • grok-1

  • cohere

  • dbrx

  • deepseek_v2

  • deepseek_v3

  • deepseek

  • olmo

  • gemma2

  • gemma3_text

  • gemma3

  • instella

  • gpt_oss

Returns:

The template object.

Return type:

LLMTemplate

Example:

from quark.torch import LLMTemplate

template = LLMTemplate.get("llama")
print(template)
classmethod register_scheme(scheme_name: str, config: QuantizationConfig) None[source]#

Register a new quantization scheme for LLMTemplate class.

Parameters:
  • scheme_name (str) – Name of the scheme.

  • config (QuantizationConfig) – Configuration for the scheme.

Example:

# Register a new quantization scheme ``int8_wo (int8 weight-only)`` to the template
from quark.torch import LLMTemplate
from quark.torch.quantization.config.config import Int8PerTensorSpec, QuantizationConfig

quant_spec = Int8PerTensorSpec(observer_method="min_max", symmetric=True, scale_type="float",
                               round_method="half_even", is_dynamic=False).to_quantization_spec()
global_config = QuantizationConfig(weight=quant_spec)

LLMTemplate.register_scheme("int8_wo", config=global_config)
classmethod unregister_scheme(scheme_name: str) None[source]#

Unregister a quantization scheme.

Parameters:

scheme_name (str) – Name of the scheme to unregister.

Example:

from quark.torch import LLMTemplate

LLMTemplate.unregister_scheme("int8")
get_config(scheme: str, algorithm: str | list[str] | None = None, kv_cache_scheme: str | None = None, min_kv_scale: float = 0.0, attention_scheme: str | None = None, layer_config: dict[str, str] | None = None, layer_type_config: dict[type[Module], str] | None = None, exclude_layers: list[str] | None = None) Config[source]#

Create a quantization configuration based on the provided parameters.

Parameters:
  • scheme (str) – Name of the quantization scheme.

  • algorithm (Optional[Union[str, List[str]]]) – Name or list of names of quantization algorithms to apply.

  • kv_cache_scheme (Optional[str]) – Name of the KV cache quantization scheme.

  • min_kv_scale (float) – Minimum value of KV Cache scale.

  • attention_scheme (Optional[str]) – Name of the attention quantization scheme.

  • layer_config (Optional[Dict[str, str]]) – Dictionary of layer name patterns and quantization scheme names.

  • layer_type_config (Optional[Dict[Type[nn.Module], str]]) – Dictionary of layer types and quantization scheme names.

  • exclude_layers (Optional[List[str]]) – List of layer names to exclude from quantization.

Example:

from quark.torch import LLMTemplate

template = LLMTemplate.get("llama")
config = template.get_config(scheme="fp8", kv_cache_scheme="fp8")