ONNX Exporting#
PyTorch provides a function to export the ONNX graph at this link.
Quark supports the export of onnx graph for int4, int8, fp8 , float16 and bfloat16 quantized models.
For int4, int8, and fp8 quantization, the quantization operators used in onnx graph are QuantizerLinear_DequantizerLinear pair. For float16 and bfloat16 quantization, the quantization operators are the cast_cast pair.
Mixed quantization of int4/uint4 and int8/uint8 is not supported currently. In other words, if the model contains both quantized nodes of uint4/int4 and uint8/int8, this function cannot be used to export the ONNX graph.
Only weight-only and static quantization is supported for now.
Example of Onnx Exporting#
export_path = "./output_dir"
batch_iter = iter(calib_dataloader)
input_args = next(batch_iter)
if args.quant_scheme in ["w_int4_per_channel_sym", "w_uint4_per_group_asym", "w_int4_per_group_sym", "w_uint4_a_bfloat16_per_group_asym"]:
uint4_int4_flag = True
else:
uint4_int4_flag = False
from quark.torch import export_onnx
export_onnx(
model=model,
output_dir="./export_onnx/",
input_args=input_args,
uint4_int4_flag=uint4_int4_flag
)