Quantization LLM Template

Quantization LLM Template#

class quark.torch.quantization.config.template.QuantizationScheme(config: QLayerConfig)[source]#: Abstract base class for quantization schemes.

class quark.torch.quantization.config.template.Int4WeightOnlyScheme(group_size: int)[source]#: Scheme for INT4 weight-only quantization.

class quark.torch.quantization.config.template.Int4WeightOnlyPerChannelScheme[source]#: Scheme for INT4 weight-only per-channel quantization.

class quark.torch.quantization.config.template.Uint4WeightOnlyScheme(group_size: int)[source]#: Scheme for UINT4 weight-only quantization.

class quark.torch.quantization.config.template.Uint4WeightOnlyPerChannelScheme[source]#: Scheme for UINT4 weight-only per-channel quantization.

class quark.torch.quantization.config.template.Int8Scheme[source]#: Scheme for INT8 weight and activation input quantization.

class quark.torch.quantization.config.template.FP8Scheme[source]#: Scheme for FP8 quantization (e4m3 format).

class quark.torch.quantization.config.template.MXFP4Scheme[source]#: Scheme for MXFP4 quantization.

class quark.torch.quantization.config.template.MXFP6E3M2Scheme[source]#: Scheme for MXFP6E3M2 quantization.

class quark.torch.quantization.config.template.MXFP6E2M3Scheme[source]#: Scheme for MXFP6E2M3 quantization.

class quark.torch.quantization.config.template.MXFP4_MXFP6E2M3Scheme[source]#: Scheme for MXFP4 weight and MXFP6E2M3 activation input quantization.

class quark.torch.quantization.config.template.MX6Scheme[source]#: Scheme for MX6 quantization.

class quark.torch.quantization.config.template.BFP16Scheme[source]#: Scheme for BFP16 quantization.

class quark.torch.quantization.config.template.MXFP4_FP8Scheme[source]#: Scheme for MXFP4 weight and FP8 activation input quantization.

class quark.torch.quantization.config.template.PTPCFP8Scheme[source]#

Scheme for PTPC FP8 quantization (Dynamic activation per-token quantization, weight quantization per-channel).

Uses FP8 Per-Channel Static for weights and FP8 Per-Token Dynamic for activations.

class quark.torch.quantization.config.template.QuantizationSchemeCollection[source]#

Collection for quantization schemes.

register_scheme(scheme_name: str, scheme: QuantizationScheme) → None[source]#: Register a quantization scheme.

unregister_scheme(scheme_name: str) → None[source]#: Unregister a quantization scheme.

get_supported_schemes() → list[str][source]#: Get list of supported quantization schemes.

get_scheme(scheme_name: str) → QuantizationScheme[source]#: Get a quantization scheme by name.

class quark.torch.quantization.config.template.LLMTemplate(model_type: str, kv_layers_name: list[str] | None = None, q_layer_name: str | list[str] | None = None, exclude_layers_name: list[str] = [], awq_config: AWQConfig | None = None, gptq_config: GPTQConfig | None = None, gptaq_config: GPTAQConfig | None = None, qronos_config: QronosConfig | None = None, smoothquant_config: SmoothQuantConfig | None = None, autosmoothquant_config: AutoSmoothQuantConfig | None = None, rotation_config: RotationConfig | None = None)[source]#

A configuration template that defines how to quantize specific types of LLM models.

Each LLM architecture (like llama, qwen, deepseek, etc.) has its own unique structure and naming patterns for layers. This template allows specifying those patterns and quantization settings in a reusable way.

Parameters:

model_type (str) – Type of the LLM model.
kv_layers_name (List[str]) – List of k_proj and v_proj layer name patterns to match. Default is None.
q_layer_name (Union[str, List[str]]) – q_proj layer name pattern to match. Default is None.
exclude_layers_name (List[str]) – List of layer name patterns to exclude from quantization. Default is [].
awq_config (AWQConfig) – Configuration for AWQ algorithm. Default is None.
gptq_config (GPTQConfig) – Configuration for GPTQ algorithm. Default is None.
smoothquant_config (SmoothQuantConfig) – Configuration for SmoothQuant algorithm. Default is None.
autosmoothquant_config (AutoSmoothQuantConfig) – Configuration for AutoSmoothQuant algorithm. Default is None.
rotation_config (RotationConfig) – Configuration for Rotation algorithm. Default is None.

Note:

The quantization schemes supported by the template are:
- fp8
- ptpc_fp8
- int4_wo_32
- int4_wo_64
- int4_wo_128
- int4_wo_per_channel
- uint4_wo_32
- uint4_wo_64
- uint4_wo_128
- uint4_wo_per_channel
- int8
- mxfp4
- mxfp6_e3m2
- mxfp6_e2m3
- mx6
- bfp16
The quantization algorithms supported by the template are:
- awq
- gptq
- smoothquant
- autosmoothquant
- rotation
The KV cache schemes supported by the template are:
- fp8
The attention schemes supported by the template are:
- fp8

Creating a Custom Template:

To create a custom template for a new model type, you can define layer name patterns and algorithm configurations specific to your model architecture. Take moonshotai/Kimi-K2-Instruct as an example:

from quark.torch import LLMTemplate

# Create a new template
template = LLMTemplate(
    model_type="kimi_k2",
    kv_layers_name=["*kv_b_proj"],
    exclude_layers_name=["lm_head"]
)

# Register the template to LLMTemplate class (optional, if you want to use the template in other places)
LLMTemplate.register_template(template)

classmethod list_available() → list[str][source]#

List all available model names of registered templates.

Returns:: List of template names.
Return type:: List[str]

Example:

from quark.torch import LLMTemplate

templates = LLMTemplate.list_available()
print(templates)  # ['llama', 'opt', 'gpt2', ...]

classmethod register_template(template: LLMTemplate) → None[source]#

Parameters:: template (LLMTemplate) – The template to register.

Example:

from quark.torch import LLMTemplate

# Create template
template = LLMTemplate(
    model_type="llama",
    kv_layers_name=["*k_proj", "*v_proj"],
    q_layer_name="*q_proj",
    exclude_layers_name=["lm_head"],
)

# Register template
LLMTemplate.register_template(template)

classmethod get(model_type: str) → LLMTemplate[source]#

Get a template by model type.

Parameters:: model_type (str) – Type of the model. It is obtained from the original LLM HuggingFace model’s model.config.model_type attribute. When the model_type field is not defined, the model.config.architecture[0] is assigned as the model_type..

Available model types:

chatglm

cohere

dbrx

deepseek

deepseek_v2

deepseek_v3

gemma2

gemma3

gemma3_text

gptj

gpt_oss

grok-1

instella

llama

llama4

mistral

mixtral

mllama

olmo

opt

phi

phi3

qwen

qwen2

qwen2_moe

qwen3

qwen3_next

qwen3_moe

qwen3_vl_moe

Returns:: The template object.
Return type:: LLMTemplate

Example:

from quark.torch import LLMTemplate

template = LLMTemplate.get("llama")
print(template)

classmethod register_scheme(scheme_name: str, config: QLayerConfig) → None[source]#

Parameters:

scheme_name (str) – Name of the scheme.
config (QLayerConfig) – Configuration for the scheme.

Example:

# Register a new quantization scheme ``int8_wo (int8 weight-only)`` to the template
from quark.torch import LLMTemplate
from quark.torch.quantization.config.config import Int8PerTensorSpec, QLayerConfig

quant_spec = Int8PerTensorSpec(observer_method="min_max", symmetric=True, scale_type="float",
                               round_method="half_even", is_dynamic=False).to_quantization_spec()
global_config = QLayerConfig(weight=quant_spec)

LLMTemplate.register_scheme("int8_wo", config=global_config)

classmethod unregister_scheme(scheme_name: str) → None[source]#

Unregister a quantization scheme.

Parameters:: scheme_name (str) – Name of the scheme to unregister.

Example:

from quark.torch import LLMTemplate

LLMTemplate.unregister_scheme("int8")

classmethod get_supported_schemes() → list[str][source]#: Get list of supported quantization schemes.

get_config(scheme: str, algorithm: str | list[str] | None = None, kv_cache_scheme: str | None = None, min_kv_scale: float = 0.0, attention_scheme: str | None = None, layer_config: dict[str, str] | None = None, layer_type_config: dict[type[Module], str] | None = None, exclude_layers: list[str] | None = None, algo_configs: dict[str, AlgoConfig] | None = None) → Config[source]#

Create a quantization configuration based on the provided parameters.

Parameters:

scheme (str) – Name of the quantization scheme.
algorithm (Optional[Union[str, List[str]]]) – Name or list of names of quantization algorithms to apply.
kv_cache_scheme (Optional[str]) – Name of the KV cache quantization scheme.
min_kv_scale (float) – Minimum value of KV Cache scale.
attention_scheme (Optional[str]) – Name of the attention quantization scheme.
layer_config (Optional[Dict[str, str]]) – Dictionary of layer name patterns and quantization scheme names.
layer_type_config (Optional[Dict[Type[nn.Module], str]]) – Dictionary of layer types and quantization scheme names.
exclude_layers (Optional[List[str]]) – List of layer names to exclude from quantization.
algo_configs (Optional[Dict[str, AlgoConfig]]) – Dictionary of algorithm names to their configurations.

Example:

from quark.torch import LLMTemplate

template = LLMTemplate.get("llama")
config = template.get_config(scheme="fp8", kv_cache_scheme="fp8")

Quantization LLM Template

Contents

Quantization LLM Template#