Quantization LLM Template#
- class quark.torch.quantization.config.template.QuantizationScheme(config: QuantizationConfig)[source]#
Abstract base class for quantization schemes.
- class quark.torch.quantization.config.template.Int4WeightOnlyScheme(group_size: int)[source]#
Scheme for INT4 weight-only quantization.
- class quark.torch.quantization.config.template.Uint4WeightOnlyScheme(group_size: int)[source]#
Scheme for UINT4 weight-only quantization.
- class quark.torch.quantization.config.template.Int8Scheme[source]#
Scheme for INT8 weight and activation input quantization.
- class quark.torch.quantization.config.template.FP8Scheme[source]#
Scheme for FP8 quantization (e4m3 format).
- class quark.torch.quantization.config.template.MXFP6E3M2Scheme[source]#
Scheme for MXFP6E3M2 quantization.
- class quark.torch.quantization.config.template.MXFP6E2M3Scheme[source]#
Scheme for MXFP6E2M3 quantization.
- class quark.torch.quantization.config.template.QuantizationSchemeCollection[source]#
Collection for quantization schemes.
- register_scheme(scheme_name: str, scheme: QuantizationScheme) None [source]#
Register a quantization scheme.
- get_scheme(scheme_name: str) QuantizationScheme [source]#
Get a quantization scheme by name.
- class quark.torch.quantization.config.template.LLMTemplate(model_type: str, kv_layers_name: list[str] | None = None, q_layer_name: str | list[str] | None = None, exclude_layers_name: list[str] = [], awq_config: AWQConfig | None = None, gptq_config: GPTQConfig | None = None, smoothquant_config: SmoothQuantConfig | None = None, autosmoothquant_config: AutoSmoothQuantConfig | None = None, rotation_config: RotationConfig | None = None)[source]#
A configuration template that defines how to quantize specific types of LLM models.
Each LLM architecture (like llama, qwen, deepseek, etc.) has its own unique structure and naming patterns for layers. This template allows specifying those patterns and quantization settings in a reusable way.
- Parameters:
model_type (str) – Type of the LLM model.
kv_layers_name (List[str]) – List of k_proj and v_proj layer name patterns to match. Default is
None
.q_layer_name (Union[str, List[str]]) – q_proj layer name pattern to match. Default is
None
.exclude_layers_name (List[str]) – List of layer name patterns to exclude from quantization. Default is
[]
.awq_config (AWQConfig) – Configuration for AWQ algorithm. Default is
None
.gptq_config (GPTQConfig) – Configuration for GPTQ algorithm. Default is
None
.smoothquant_config (SmoothQuantConfig) – Configuration for SmoothQuant algorithm. Default is
None
.autosmoothquant_config (AutoSmoothQuantConfig) – Configuration for AutoSmoothQuant algorithm. Default is
None
.rotation_config (RotationConfig) – Configuration for Rotation algorithm. Default is
None
.
- Note:
- The quantization schemes supported by the template are:
fp8
int4_wo_32
int4_wo_64
int4_wo_128
uint4_wo_32
uint4_wo_64
uint4_wo_128
int8
mxfp4
mxfp6_e3m2
mxfp6_e2m3
mx6
bfp16
- The quantization algorithms supported by the template are:
awq
gptq
smoothquant
autosmoothquant
rotation
- The KV cache schemes supported by the template are:
fp8
- The attention schemes supported by the template are:
fp8
Creating a Custom Template:
To create a custom template for a new model type, you can define layer name patterns and algorithm configurations specific to your model architecture. Take moonshotai/Kimi-K2-Instruct as an example:
from quark.torch import LLMTemplate # Create a new template template = LLMTemplate( model_type="kimi_k2", kv_layers_name=["*kv_b_proj"], exclude_layers_name=["lm_head"] ) # Register the template to LLMTemplate class (optional, if you want to use the template in other places) LLMTemplate.register_template(template)
- classmethod list_available() list[str] [source]#
List all available model names of registered templates.
- Returns:
List of template names.
- Return type:
List[str]
Example:
from quark.torch import LLMTemplate templates = LLMTemplate.list_available() print(templates) # ['llama', 'opt', 'gpt2', ...]
- classmethod register_template(template: LLMTemplate) None [source]#
Register a template.
- Parameters:
template (LLMTemplate) – The template to register.
Example:
from quark.torch import LLMTemplate # Create template template = LLMTemplate( model_type="llama", kv_layers_name=["*k_proj", "*v_proj"], q_layer_name="*q_proj" ) # Register template LLMTemplate.register_template(template)
- classmethod get(model_type: str) LLMTemplate [source]#
Get a template by model type.
- Parameters:
model_type (str) – Type of the model. It is obtained from the original LLM HuggingFace model’s
model.config.model_type
attribute. When the model_type field is not defined, themodel.config.architecture[0]
is assigned as the model_type..
Available model types:
llama
mllama
llama4
opt
qwen2_moe
qwen2
qwen
chatglm
phi3
phi
mistral
mixtral
gptj
grok-1
cohere
dbrx
deepseek_v2
deepseek_v3
deepseek
olmo
gemma2
gemma3_text
gemma3
instella
gpt_oss
- Returns:
The template object.
- Return type:
Example:
from quark.torch import LLMTemplate template = LLMTemplate.get("llama") print(template)
- classmethod register_scheme(scheme_name: str, config: QuantizationConfig) None [source]#
Register a new quantization scheme for LLMTemplate class.
- Parameters:
scheme_name (str) – Name of the scheme.
config (QuantizationConfig) – Configuration for the scheme.
Example:
# Register a new quantization scheme ``int8_wo (int8 weight-only)`` to the template from quark.torch import LLMTemplate from quark.torch.quantization.config.config import Int8PerTensorSpec, QuantizationConfig quant_spec = Int8PerTensorSpec(observer_method="min_max", symmetric=True, scale_type="float", round_method="half_even", is_dynamic=False).to_quantization_spec() global_config = QuantizationConfig(weight=quant_spec) LLMTemplate.register_scheme("int8_wo", config=global_config)
- classmethod unregister_scheme(scheme_name: str) None [source]#
Unregister a quantization scheme.
- Parameters:
scheme_name (str) – Name of the scheme to unregister.
Example:
from quark.torch import LLMTemplate LLMTemplate.unregister_scheme("int8")
- get_config(scheme: str, algorithm: str | list[str] | None = None, kv_cache_scheme: str | None = None, min_kv_scale: float = 0.0, attention_scheme: str | None = None, layer_config: dict[str, str] | None = None, layer_type_config: dict[type[Module], str] | None = None, exclude_layers: list[str] | None = None) Config [source]#
Create a quantization configuration based on the provided parameters.
- Parameters:
scheme (str) – Name of the quantization scheme.
algorithm (Optional[Union[str, List[str]]]) – Name or list of names of quantization algorithms to apply.
kv_cache_scheme (Optional[str]) – Name of the KV cache quantization scheme.
min_kv_scale (float) – Minimum value of KV Cache scale.
attention_scheme (Optional[str]) – Name of the attention quantization scheme.
layer_config (Optional[Dict[str, str]]) – Dictionary of layer name patterns and quantization scheme names.
layer_type_config (Optional[Dict[Type[nn.Module], str]]) – Dictionary of layer types and quantization scheme names.
exclude_layers (Optional[List[str]]) – List of layer names to exclude from quantization.
Example:
from quark.torch import LLMTemplate template = LLMTemplate.get("llama") config = template.get_config(scheme="fp8", kv_cache_scheme="fp8")