Getting started with AMD Quark

Contents

Getting started with AMD Quark#

AMD Quark provides a streamlined approach to quantizing models in both PyTorch and ONNX formats, enabling efficient deployment across various hardware platforms.

Users need to choose which flow they will use for quantizing their model. Generally speaking, the PyTorch workflow is recommended for large language models (LLMs), otherwise the ONNX flow is recommended. Ryzen AI NPU is only supported by the ONNX flow, while PyTorch flow supports ROCm and CUDA accelerators.

Typically, quantizing a floating-point model with AMD Quark involves the following steps:

Load the original floating-point model.
Define the data loader for calibration (optional).
Set the quantization configuration.
Use the AMD Quark API to perform an in-place replacement of the model’s modules with quantized modules.
(Optional, only supported for PyTorch flow) Export the quantized model to other formats for deployment, such as ONNX, Hugging Face safetensors, etc.

Comparing Quark’s ONNX and PyTorch capabilities#

Each Quark workflow (PyTorch and ONNX) possesses its own set of features, data type support, and characteristics, catering to different model architectures and deployment scenarios. Understanding these nuances is crucial for optimal quantization results.

Feature Name	Quark for PyTorch	Quark for ONNX
Data Type	Float16 / Bfloat16 / Int4 / Uint4 / Int8 / OCP_FP8_E4M3 / OCP_MXFP8_E4M3 / OCP_MXFP6 / OCP_MXFP4 / OCP_MXINT8	Int8 / Uint8 / Int16 / Uint16 / Int32 / Uint32 / Float16 / Bfloat16
Quant Mode	Eager Mode / FX Graph Mode	ONNX Graph Mode
Quant Strategy	Static quant / Dynamic quant / Weight only	Static quant / Weight only / Dynamic quant
Quant Scheme	Per tensor / Per channel / Per group	Per tensor / Per channel
Symmetric	Symmetric / Asymmetric	Symmetric / Asymmetric
Calibration method	MinMax / Percentile / MSE	MinMax / Percentile / MinMSE / Entropy / NonOverflow
Scale Type	Float32 / Float16	Float32 / Float16
KV-Cache Quant	FP8 KV-Cache Quant	N/A
Supported Ops	nn.Linear / nn.Conv2d / nn.ConvTranspose2d / nn.Embedding / nn.EmbeddingBag	Most ONNX ops. (Full List)
Pre-Quant Optimization	SmoothQuant	QuaRot / SmoothQuant (Single_GPU/CPU) / CLE / Bias Correction
Quantization Algorithm	AWQ / GPTQ	AdaQuant / AdaRound / GPTQ
Export Format	ONNX / Json-safetensors / GGUF(Q4_1)	N/A
Operating Systems	Linux (ROCm/CUDA) / Windows (CPU)	Linux(ROCm/CUDA) / Windows(CPU)

Next steps#