Getting started with AMD Quark

Getting started with AMD Quark#

AMD Quark provides a streamlined approach to quantizing models in both PyTorch and ONNX formats, enabling efficient deployment across various hardware platforms.

Users need to choose which flow they will use for quantizing their model. Generally speaking, the PyTorch workflow is recommended for large language models (LLMs), otherwise the ONNX flow is recommended. Ryzen AI NPU is only supported by the ONNX flow, while PyTorch flow supports ROCm and CUDA accelerators.

Typically, quantizing a floating-point model with AMD Quark involves the following steps:

  1. Load the original floating-point model.

  2. Define the data loader for calibration (optional).

  3. Set the quantization configuration.

  4. Use the AMD Quark API to perform an in-place replacement of the model’s modules with quantized modules.

  5. (Optional, only supported for PyTorch flow) Export the quantized model to other formats for deployment, such as ONNX, Hugging Face safetensors, etc.

Comparing Quark’s ONNX and PyTorch capabilities#

Each Quark workflow (PyTorch and ONNX) possesses its own set of features, data type support, and characteristics, catering to different model architectures and deployment scenarios. Understanding these nuances is crucial for optimal quantization results.

Feature Name

Quark for PyTorch

Quark for ONNX

Data Type

Float16 / Bfloat16 / Int4 / Uint4 / Int8 / OCP_FP8_E4M3 / OCP_MXFP8_E4M3 / OCP_MXFP6 / OCP_MXFP4 / OCP_MXINT8

Int8 / Uint8 / Int16 / Uint16 / Int32 / Uint32 / Float16 / Bfloat16

Quant Mode

Eager Mode / FX Graph Mode

ONNX Graph Mode

Quant Strategy

Static quant / Dynamic quant / Weight only

Static quant / Weight only / Dynamic quant

Quant Scheme

Per tensor / Per channel / Per group

Per tensor / Per channel

Symmetric

Symmetric / Asymmetric

Symmetric / Asymmetric

Calibration method

MinMax / Percentile / MSE

MinMax / Percentile / MinMSE / Entropy / NonOverflow

Scale Type

Float32 / Float16

Float32 / Float16

KV-Cache Quant

FP8 KV-Cache Quant

N/A

Supported Ops

nn.Linear / nn.Conv2d / nn.ConvTranspose2d / nn.Embedding / nn.EmbeddingBag

Most ONNX ops. (Full List)

Pre-Quant Optimization

SmoothQuant

QuaRot / SmoothQuant (Single_GPU/CPU) / CLE / Bias Correction

Quantization Algorithm

AWQ / GPTQ

AdaQuant / AdaRound / GPTQ

Export Format

ONNX / Json-safetensors / GGUF(Q4_1)

N/A

Operating Systems

Linux (ROCm/CUDA) / Windows (CPU)

Linux(ROCm/CUDA) / Windows(CPU)

Next steps#