Configuring PyTorch Quantization for Large Language Models#

AMD Quark for PyTorch provides a convenient way to configure quantization for Large Language Models (LLMs) through the LLMTemplate class. This approach simplifies the configuration process by providing pre-defined settings for popular LLM architectures.

Using LLMTemplate for Quantization Configuration#

Supported Quantization Schemes#

The following table shows the quantization schemes supported by LLMTemplate, their detailed configurations and the platforms they are supported on:

Scheme	Configuration Details	Platforms Supported
int4_wo_128	Weight-only symmetric INT4 Per-group quantization Group size 128	RyzenAI ZenDNN
int4_wo_64	Weight-only symmetric INT4 Per-group quantization Group size 64	RyzenAI ZenDNN
int4_wo_32	Weight-only symmetric INT4 Per-group quantization Group size 32	RyzenAI ZenDNN
uint4_wo_128	Weight-only asymmetric UINT4 Per-group quantization Group size 128	RyzenAI ZenDNN
uint4_wo_64	Weight-only asymmetric UINT4 Per-group quantization Group size 64	RyzenAI ZenDNN
uint4_wo_32	Weight-only asymmetric UINT4 Per-group quantization Group size 32	RyzenAI ZenDNN
int8	INT8 quantization Per-tensor quantization Static quantization	RyzenAI ZenDNN
fp8	FP8 E4M3 format Per-tensor quantization Static quantization	AMD MI300 GPU AMD MI350 GPU AMD MI355 GPU
mxfp4	OCP MXFP4 format Per-group quantization Group size 32 Dynamic quantization	AMD MI350 GPU AMD MI355 GPU
mxfp6_e2m3	OCP MXFP6E2M3 format Per-group quantization Group size 32 Dynamic quantization	AMD MI350 GPU AMD MI355 GPU
mxfp6_e3m2	OCP MXFP6E3M2 format Per-group quantization Group size 32 Dynamic quantization	AMD MI350 GPU AMD MI355 GPU
mx6	MX6 format Per-group quantization Group size 32 Dynamic quantization	TBD
bfp16	BFP16 format Per-group quantization Group size 8 Dynamic quantization	TBD

The LLMTemplate class offers several methods to create and customize quantization configurations:

1. Using Built-in Templates#

AMD Quark includes built-in templates for popular LLM architectures. You can get a list of available templates and use them directly:

from quark.torch import LLMTemplate

# List available templates
templates = LLMTemplate.list_available()
print(templates)  # ['llama', 'opt', 'qwen', 'mistral', ...]

# Get a specific template
llama_template = LLMTemplate.get("llama")

# Create a basic configuration
config = llama_template.get_config(scheme="fp8", kv_cache_scheme="fp8")

Note

In the function get(), the parameter model_type is obtained from the model.config.model_type attribute. For example, for the model facebook/opt-125m, the model_type is opt. See config.json. When the model_type field is not defined, the model.config.architecture[0] is assigned as the model_type.

2. Creating Configurations with Advanced Options#

The template system supports various quantization options including algorithms, KV cache, attention schemes, layer-wise quantization and exclude_layers, etc.

from quark.torch import LLMTemplate

# Get a specific template
llama_template = LLMTemplate.get("llama")

# Create configuration with multiple options
config = llama_template.get_config(
    scheme="int4_wo_128",          # Global quantization scheme
    algorithm="awq",               # Quantization algorithm
    kv_cache_scheme="fp8",         # KV cache quantization
    min_kv_scale=1.0,              # Minimum value of KV Cache scale
    attention_scheme="fp8",        # Attention quantization
    layer_config={                 # Layer-specific configurations
        "*.mlp.gate_proj": "mxfp4",
        "*.mlp.up_proj": "mxfp4",
        "*.mlp.down_proj": "mxfp4"
    },
    layer_type_config={            # Layer type configurations
        nn.LayerNorm: "fp8"
    },
    exclude_layers=["lm_head"]      # Exclude layers from quantization
)

Notes:

KV cache quantization is only supported for fp8 now.
The minimum value of KV Cache scale is 1.0.
Attention quantization is only supported for fp8 now.
Algorithm is supported for awq, gptq, smoothquant, autosmoothquant and rotation.
Layer-wise and layer-type-wise are supported all the quantization schemes.
Layer-wise and layer-type-wise configurations can override global schemes.

3. Creating New Templates#

You can create a new model’s template by subclassing LLMTemplate and its quantization configuration. Take moonshotai/Kimi-K2-Instruct as an example:

from quark.torch import LLMTemplate

# Create a new template
template = LLMTemplate(
  model_type="kimi_k2",
  kv_layer_name=["*kv_b_proj"],
  exclude_layers=["lm_head"]
)

# Register the template to LLMTemplate class (optional, if you want to use the template in other places)
LLMTemplate.register_template(template)

# Get the template
template = LLMTemplate.get("kimi_k2")

# Create a configuration
config = template.get_config(
    scheme="fp8",
    kv_cache_scheme="fp8"
)

4. Registering Custom Schemes#

You can register custom quantization schemes for use with templates:

from quark.torch.quantization.config.config import Int8PerTensorSpec, QuantizationConfig
from quark.torch import LLMTemplate

# Create custom quantization specification
quant_spec = Int8PerTensorSpec(
    observer_method="min_max",
    symmetric=True,
    scale_type="float",
    round_method="half_even",
    is_dynamic=False
).to_quantization_spec()

# Create and register custom scheme
global_config = QuantizationConfig(weight=quant_spec)
LLMTemplate.register_scheme("custom_int8_wo", config=global_config)

# Get a specific template
llama_template = LLMTemplate.get("llama")

# Use custom scheme
config = llama_template.get_config(scheme="custom_int8_wo")

The template-based configuration system provides a streamlined way to set up quantization for LLMs while maintaining flexibility for customization. It handles common patterns and configurations automatically while allowing for specific adjustments when needed.