Configuring PyTorch Quantization for Large Language Models#
AMD Quark for PyTorch provides a convenient way to configure quantization for Large Language Models (LLMs) through the LLMTemplate
class. This approach simplifies the configuration process by providing pre-defined settings for popular LLM architectures.
Using LLMTemplate for Quantization Configuration#
Supported Quantization Schemes#
The following table shows the quantization schemes supported by LLMTemplate
, their detailed configurations and the platforms they are supported on:
Scheme |
Configuration Details |
Platforms Supported |
---|---|---|
int4_wo_128 |
|
|
int4_wo_64 |
|
|
int4_wo_32 |
|
|
uint4_wo_128 |
|
|
uint4_wo_64 |
|
|
uint4_wo_32 |
|
|
int8 |
|
|
fp8 |
|
|
mxfp4 |
|
|
mxfp6_e2m3 |
|
|
mxfp6_e3m2 |
|
|
mx6 |
|
TBD |
bfp16 |
|
TBD |
The LLMTemplate
class offers several methods to create and customize quantization configurations:
1. Using Built-in Templates#
AMD Quark includes built-in templates for popular LLM architectures. You can get a list of available templates and use them directly:
from quark.torch import LLMTemplate
# List available templates
templates = LLMTemplate.list_available()
print(templates) # ['llama', 'opt', 'qwen', 'mistral', ...]
# Get a specific template
llama_template = LLMTemplate.get("llama")
# Create a basic configuration
config = llama_template.get_config(scheme="fp8", kv_cache_scheme="fp8")
Note
In the function get()
, the parameter model_type
is obtained from the model.config.model_type
attribute. For example, for the model facebook/opt-125m
, the model_type
is opt
. See config.json.
When the model_type field is not defined, the model.config.architecture[0]
is assigned as the model_type.
2. Creating Configurations with Advanced Options#
The template system supports various quantization options including algorithms, KV cache, attention schemes, layer-wise quantization and exclude_layers, etc.
from quark.torch import LLMTemplate
# Get a specific template
llama_template = LLMTemplate.get("llama")
# Create configuration with multiple options
config = llama_template.get_config(
scheme="int4_wo_128", # Global quantization scheme
algorithm="awq", # Quantization algorithm
kv_cache_scheme="fp8", # KV cache quantization
min_kv_scale=1.0, # Minimum value of KV Cache scale
attention_scheme="fp8", # Attention quantization
layer_config={ # Layer-specific configurations
"*.mlp.gate_proj": "mxfp4",
"*.mlp.up_proj": "mxfp4",
"*.mlp.down_proj": "mxfp4"
},
layer_type_config={ # Layer type configurations
nn.LayerNorm: "fp8"
},
exclude_layers=["lm_head"] # Exclude layers from quantization
)
Notes:
KV cache quantization is only supported for fp8 now.
The minimum value of KV Cache scale is 1.0.
Attention quantization is only supported for fp8 now.
Algorithm is supported for awq, gptq, smoothquant, autosmoothquant and rotation.
Layer-wise and layer-type-wise are supported all the quantization schemes.
Layer-wise and layer-type-wise configurations can override global schemes.
3. Creating New Templates#
You can create a new model’s template by subclassing LLMTemplate
and its quantization configuration. Take moonshotai/Kimi-K2-Instruct as an example:
from quark.torch import LLMTemplate
# Create a new template
template = LLMTemplate(
model_type="kimi_k2",
kv_layer_name=["*kv_b_proj"],
exclude_layers=["lm_head"]
)
# Register the template to LLMTemplate class (optional, if you want to use the template in other places)
LLMTemplate.register_template(template)
# Get the template
template = LLMTemplate.get("kimi_k2")
# Create a configuration
config = template.get_config(
scheme="fp8",
kv_cache_scheme="fp8"
)
4. Registering Custom Schemes#
You can register custom quantization schemes for use with templates:
from quark.torch.quantization.config.config import Int8PerTensorSpec, QuantizationConfig
from quark.torch import LLMTemplate
# Create custom quantization specification
quant_spec = Int8PerTensorSpec(
observer_method="min_max",
symmetric=True,
scale_type="float",
round_method="half_even",
is_dynamic=False
).to_quantization_spec()
# Create and register custom scheme
global_config = QuantizationConfig(weight=quant_spec)
LLMTemplate.register_scheme("custom_int8_wo", config=global_config)
# Get a specific template
llama_template = LLMTemplate.get("llama")
# Use custom scheme
config = llama_template.get_config(scheme="custom_int8_wo")
The template-based configuration system provides a streamlined way to set up quantization for LLMs while maintaining flexibility for customization. It handles common patterns and configurations automatically while allowing for specific adjustments when needed.