Mix Precision Auto-Search

Mix Precision Auto-Search#

The Mix Precision Auto-Search workflow automatically finds the optimal mixed-precision quantization configuration for large language models running on AMD Instinct GPUs via vLLM.

Given an accuracy loss budget—for example, GSM8K must not drop by more than 2%—it searches from least to most aggressive quantization and selects the most aggressively quantized configuration whose accuracy stays within the threshold, without any manual parameter testing.

Supported Quantization Modes#

Hardware	Supported Modes
MI300	`native`, `fp8`, `ptpc_fp8`
MI325	`native`, `fp8`, `ptpc_fp8`
MI355	`native`, `fp8`, `ptpc_fp8`, `mxfp4`, `mxfp4_fp8`, `mxfp6_e2m3`

Workflow Overview#

Load model on meta device — builds the layer graph for quantization config generation with no GPU memory cost.
Generate candidate configs — sorted from least to most aggressive quantization.
Evaluate baseline — run GSM8K on the original (unquantized) vLLM model.
Search loop — for each config in rank order, re-quantize, evaluate, check accuracy threshold, and reset.
Select best config — the most aggressively quantized config that passes the threshold.
Export (optional) — apply best config and save as safetensors.

Quick Start#

Note

This workflow requires the vLLM ROCm container. See the example README for full environment setup instructions.

cd /workspace/quark/examples/torch/experimental/mix_precision

# Search all modes for MI300
python mix_precision.py \
    --model_dir /models/Qwen3.5-397B-A17B \
    --hardware mi300 \
    -tp 8 \
    --gpu-memory-utilization 0.8 \
    --export_best_model

Mix Precision Auto-Search

Contents

Mix Precision Auto-Search#

Supported Quantization Modes#

Workflow Overview#

Quick Start#

Further Reading#