Getting started with AMD Quark#
AMD Quark provides a streamlined approach to quantizing models in both PyTorch and ONNX formats, enabling efficient deployment across various hardware platforms.
Users need to choose which flow they will use for quantizing their model. Generally speaking, the PyTorch workflow is recommended for large language models (LLMs), otherwise the ONNX flow is recommended. Ryzen AI NPU is only supported by the ONNX flow, while PyTorch flow supports ROCm and CUDA accelerators.
Typically, quantizing a floating-point model with AMD Quark involves the following steps:
Load the original floating-point model.
Define the data loader for calibration (optional).
Set the quantization configuration.
Use the AMD Quark API to perform an in-place replacement of the model’s modules with quantized modules.
(Optional, only supported for PyTorch flow) Export the quantized model to other formats for deployment, such as ONNX, Hugging Face safetensors, etc.
Comparing Quark’s ONNX and PyTorch capabilities#
Each Quark workflow (PyTorch and ONNX) possesses its own set of features, data type support, and characteristics, catering to different model architectures and deployment scenarios. Understanding these nuances is crucial for optimal quantization results.
Feature Name |
Quark for PyTorch |
Quark for ONNX |
---|---|---|
Data Type |
Float16 / Bfloat16 / Int4 / Uint4 / Int8 / OCP_FP8_E4M3 / OCP_MXFP8_E4M3 / OCP_MXFP6 / OCP_MXFP4 / OCP_MXINT8 |
Int8 / Uint8 / Int16 / Uint16 / Int32 / Uint32 / Float16 / Bfloat16 |
Quant Mode |
Eager Mode / FX Graph Mode |
ONNX Graph Mode |
Quant Strategy |
Static quant / Dynamic quant / Weight only |
Static quant / Weight only / Dynamic quant |
Quant Scheme |
Per tensor / Per channel / Per group |
Per tensor / Per channel |
Symmetric |
Symmetric / Asymmetric |
Symmetric / Asymmetric |
Calibration method |
MinMax / Percentile / MSE |
MinMax / Percentile / MinMSE / Entropy / NonOverflow |
Scale Type |
Float32 / Float16 |
Float32 / Float16 |
KV-Cache Quant |
FP8 KV-Cache Quant |
N/A |
Supported Ops |
nn.Linear / nn.Conv2d / nn.ConvTranspose2d / nn.Embedding / nn.EmbeddingBag |
Most ONNX ops. (Full List) |
Pre-Quant Optimization |
SmoothQuant |
QuaRot / SmoothQuant (Single_GPU/CPU) / CLE / Bias Correction |
Quantization Algorithm |
AWQ / GPTQ |
AdaQuant / AdaRound / GPTQ |
Export Format |
ONNX / Json-safetensors / GGUF(Q4_1) |
N/A |
Operating Systems |
Linux (ROCm/CUDA) / Windows (CPU) |
Linux(ROCm/CUDA) / Windows(CPU) |