Mix Precision Auto-Search#
The Mix Precision Auto-Search workflow automatically finds the optimal mixed-precision quantization configuration for large language models running on AMD Instinct GPUs via vLLM.
Given an accuracy loss budget—for example, GSM8K must not drop by more than 2%—it searches from least to most aggressive quantization and selects the most aggressively quantized configuration whose accuracy stays within the threshold, without any manual parameter testing.
Supported Quantization Modes#
Hardware |
Supported Modes |
|---|---|
MI300 |
|
MI325 |
|
MI355 |
|
Workflow Overview#
Load model on meta device — builds the layer graph for quantization config generation with no GPU memory cost.
Generate candidate configs — sorted from least to most aggressive quantization.
Evaluate baseline — run GSM8K on the original (unquantized) vLLM model.
Search loop — for each config in rank order, re-quantize, evaluate, check accuracy threshold, and reset.
Select best config — the most aggressively quantized config that passes the threshold.
Export (optional) — apply best config and save as safetensors.
Quick Start#
Note
This workflow requires the vLLM ROCm container. See the example README for full environment setup instructions.
cd /workspace/quark/examples/torch/experimental/mix_precision
# Search all modes for MI300
python mix_precision.py \
--model_dir /models/Qwen3.5-397B-A17B \
--hardware mi300 \
-tp 8 \
--gpu-memory-utilization 0.8 \
--export_best_model