Configuring PyTorch Quantization for Large Language Models#

AMD Quark for PyTorch provides a convenient way to configure quantization for Large Language Models (LLMs) through the LLMTemplate class. This approach simplifies the configuration process by providing pre-defined settings for popular LLM architectures.

Using LLMTemplate for Quantization Configuration#

Supported Quantization Schemes#

The following table shows the quantization schemes supported by LLMTemplate, their detailed configurations and the platforms they are supported on:

Scheme

Configuration Details

Platforms Supported

int4_wo_128

  • Weight-only symmetric INT4

  • Per-group quantization

  • Group size 128

  • RyzenAI

  • ZenDNN

int4_wo_64

  • Weight-only symmetric INT4

  • Per-group quantization

  • Group size 64

  • RyzenAI

  • ZenDNN

int4_wo_32

  • Weight-only symmetric INT4

  • Per-group quantization

  • Group size 32

  • RyzenAI

  • ZenDNN

uint4_wo_128

  • Weight-only asymmetric UINT4

  • Per-group quantization

  • Group size 128

  • RyzenAI

  • ZenDNN

uint4_wo_64

  • Weight-only asymmetric UINT4

  • Per-group quantization

  • Group size 64

  • RyzenAI

  • ZenDNN

uint4_wo_32

  • Weight-only asymmetric UINT4

  • Per-group quantization

  • Group size 32

  • RyzenAI

  • ZenDNN

int8

  • INT8 quantization

  • Per-tensor quantization

  • Static quantization

  • RyzenAI

  • ZenDNN

fp8

  • FP8 E4M3 format

  • Per-tensor quantization

  • Static quantization

  • AMD MI300 GPU

  • AMD MI350 GPU

  • AMD MI355 GPU

mxfp4

  • OCP MXFP4 format

  • Per-group quantization

  • Group size 32

  • Dynamic quantization

  • AMD MI350 GPU

  • AMD MI355 GPU

mxfp6_e2m3

  • OCP MXFP6E2M3 format

  • Per-group quantization

  • Group size 32

  • Dynamic quantization

  • AMD MI350 GPU

  • AMD MI355 GPU

mxfp6_e3m2

  • OCP MXFP6E3M2 format

  • Per-group quantization

  • Group size 32

  • Dynamic quantization

  • AMD MI350 GPU

  • AMD MI355 GPU

mx6

  • MX6 format

  • Per-group quantization

  • Group size 32

  • Dynamic quantization

TBD

bfp16

  • BFP16 format

  • Per-group quantization

  • Group size 8

  • Dynamic quantization

TBD

The LLMTemplate class offers several methods to create and customize quantization configurations:

1. Using Built-in Templates#

AMD Quark includes built-in templates for popular LLM architectures. You can get a list of available templates and use them directly:

from quark.torch import LLMTemplate

# List available templates
templates = LLMTemplate.list_available()
print(templates)  # ['llama', 'opt', 'qwen', 'mistral', ...]

# Get a specific template
llama_template = LLMTemplate.get("llama")

# Create a basic configuration
config = llama_template.get_config(scheme="fp8", kv_cache_scheme="fp8")

Note

In the function get(), the parameter model_type is obtained from the model.config.model_type attribute. For example, for the model facebook/opt-125m, the model_type is opt. See config.json. When the model_type field is not defined, the model.config.architecture[0] is assigned as the model_type.

2. Creating Configurations with Advanced Options#

The template system supports various quantization options including algorithms, KV cache, attention schemes, layer-wise quantization and exclude_layers, etc.

from quark.torch import LLMTemplate

# Get a specific template
llama_template = LLMTemplate.get("llama")

# Create configuration with multiple options
config = llama_template.get_config(
    scheme="int4_wo_128",          # Global quantization scheme
    algorithm="awq",               # Quantization algorithm
    kv_cache_scheme="fp8",         # KV cache quantization
    min_kv_scale=1.0,              # Minimum value of KV Cache scale
    attention_scheme="fp8",        # Attention quantization
    layer_config={                 # Layer-specific configurations
        "*.mlp.gate_proj": "mxfp4",
        "*.mlp.up_proj": "mxfp4",
        "*.mlp.down_proj": "mxfp4"
    },
    layer_type_config={            # Layer type configurations
        nn.LayerNorm: "fp8"
    },
    exclude_layers=["lm_head"]      # Exclude layers from quantization
)

Notes:

  • KV cache quantization is only supported for fp8 now.

  • The minimum value of KV Cache scale is 1.0.

  • Attention quantization is only supported for fp8 now.

  • Algorithm is supported for awq, gptq, smoothquant, autosmoothquant and rotation.

  • Layer-wise and layer-type-wise are supported all the quantization schemes.

  • Layer-wise and layer-type-wise configurations can override global schemes.

3. Creating New Templates#

You can create a new model’s template by subclassing LLMTemplate and its quantization configuration. Take moonshotai/Kimi-K2-Instruct as an example:

from quark.torch import LLMTemplate

# Create a new template
template = LLMTemplate(
  model_type="kimi_k2",
  kv_layer_name=["*kv_b_proj"],
  exclude_layers=["lm_head"]
)

# Register the template to LLMTemplate class (optional, if you want to use the template in other places)
LLMTemplate.register_template(template)

# Get the template
template = LLMTemplate.get("kimi_k2")

# Create a configuration
config = template.get_config(
    scheme="fp8",
    kv_cache_scheme="fp8"
)

4. Registering Custom Schemes#

You can register custom quantization schemes for use with templates:

from quark.torch.quantization.config.config import Int8PerTensorSpec, QuantizationConfig
from quark.torch import LLMTemplate

# Create custom quantization specification
quant_spec = Int8PerTensorSpec(
    observer_method="min_max",
    symmetric=True,
    scale_type="float",
    round_method="half_even",
    is_dynamic=False
).to_quantization_spec()

# Create and register custom scheme
global_config = QuantizationConfig(weight=quant_spec)
LLMTemplate.register_scheme("custom_int8_wo", config=global_config)

# Get a specific template
llama_template = LLMTemplate.get("llama")

# Use custom scheme
config = llama_template.get_config(scheme="custom_int8_wo")

The template-based configuration system provides a streamlined way to set up quantization for LLMs while maintaining flexibility for customization. It handles common patterns and configurations automatically while allowing for specific adjustments when needed.