Using Quark-Quantized Diffusion Models with HuggingFace Diffusers#
Quark integrates with HuggingFace Diffusers so that diffusion models
quantized and exported with Quark can be reloaded through the standard
from_pretrained API, with no per-layer setup code on the user’s
side.
This page covers the reload path that ships in the Quark wheel
today: take a Quark-quantized checkpoint (produced with
quark.torch.export_safetensors) and load it back into a diffusers
pipeline. For producing the checkpoint in the first place, see
Quantizing Diffusion Models with Quark.
Note
The integration mirrors the Quark integration in HuggingFace
Transformers:
the quantizer is registered with diffusers’ AUTO_QUANTIZER_MAPPING
under the quark method, and dispatch is automatic once the
plugin is imported.
Prerequisites#
pip install amd-quark diffusers transformers accelerate
Importing quark.integrations.diffusers registers
QuarkDiffusersQuantizer and QuarkQuantizationConfig with
diffusers’ auto-quantizer mappings. This import is required before
loading a Quark-quantized model:
import quark.integrations.diffusers # noqa: F401 -- registers the quark quantizer
End-to-end: quantize, export, reload#
The full lifecycle has three stages. Stages 1 and 2 are the offline workflow from Quantizing Diffusion Models with Quark; stage 3 is the reload covered here.
Stage 1 – quantize a submodule#
import torch
from diffusers import DiffusionPipeline
from quark.torch import ModelQuantizer
from quark.torch.quantization.config.config import Int8PerTensorSpec, QConfig, QLayerConfig
from quark.torch.utils.diffusers import get_calib_dataloader
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
torch_dtype=torch.float16, variant="fp16",
)
pipe.to("cuda")
prompts = [
"A serene lake reflecting mountains at sunset",
"A futuristic city with flying cars at night",
]
dataloader = get_calib_dataloader(pipe, pipe.unet, prompts, n_steps=20, guidance_scale=8.0)
weight_spec = Int8PerTensorSpec(
observer_method="min_max", symmetric=True, scale_type="float",
round_method="half_even", is_dynamic=False,
).to_quantization_spec()
qconfig = QConfig(global_quant_config=QLayerConfig(weight=weight_spec))
pipe.unet = ModelQuantizer(qconfig).quantize_model(pipe.unet, dataloader)
Stage 2 – export the quantized submodule#
export_safetensors detects a diffusers ModelMixin and writes a
diffusion_pytorch_model.safetensors plus a config.json that
carries a quantization_config block describing the Quark
QConfig.
from quark.torch import export_safetensors
export_safetensors(pipe.unet, "sdxl-quark-int8/unet")
Stage 3 – reload through from_pretrained#
Because config.json carries quant_method = "quark", the
diffusers loader dispatches to QuarkDiffusersQuantizer automatically.
It replaces the relevant layers and loads the quantized state dict, then
freezes the model so it is inference-ready.
import torch
import quark.integrations.diffusers # noqa: F401 -- registers the quark quantizer
from diffusers import UNet2DConditionModel, DiffusionPipeline
unet = UNet2DConditionModel.from_pretrained("sdxl-quark-int8/unet", torch_dtype=torch.float16)
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
unet=unet,
torch_dtype=torch.float16, variant="fp16",
).to("cuda")
image = pipe("A cat on a windowsill", num_inference_steps=30, guidance_scale=8.0).images[0]
image.save("sdxl_int8_reloaded.png")
If a model on the Hub already carries a quantization_config block in
its config.json (for example, a pre-quantized pipeline published by
AMD), no extra setup beyond the plugin import is required – the loader
sees quant_method = "quark" and instantiates
QuarkDiffusersQuantizer for you.
How dispatch works#
quark/integrations/diffusers/__init__.py registers two entries when
imported:
from diffusers.quantizers.auto import (
AUTO_QUANTIZATION_CONFIG_MAPPING,
AUTO_QUANTIZER_MAPPING,
)
AUTO_QUANTIZER_MAPPING["quark"] = QuarkDiffusersQuantizer
AUTO_QUANTIZATION_CONFIG_MAPPING["quark"] = QuarkQuantizationConfig
At load time, from_pretrained reads the quant_method field from
the checkpoint’s quantization_config, looks it up in these mappings,
and drives the load through QuarkDiffusersQuantizer:
_process_model_before_weight_loadingrebuilds theQConfigfrom the serialized dict and applies Quark’s module transformation so the state dict loads into the correct quantized modules._process_model_after_weight_loadingcallsModelQuantizer.freeze(model, quantize=False)so the model is inference-ready and compatible withtorch.compile.
What is supported today#
Capability |
Status |
|---|---|
Reload a Quark-quantized checkpoint via |
Supported |
Weight-only configurations (INT8, INT4, MXFP4, …) |
Supported and covered by the integration test suite |
Activation-quantized configurations (w8a8, SVDQuant, FP8 with calibrated activations) |
The activation scales are captured during offline calibration and written into the checkpoint; reload support depends on the configuration. Validate the specific config end-to-end before relying on it in production. |