Exporting#
Onnx Exporting#
PyTorch provides a function to export the ONNX graph at this link. Quark supports the export of onnx graph for int4, in8, fp8 , float16 and bfloat16 quantized models. For int4, int8, and fp8 quantization, the quantization operators used in onnx graph are QuantizerLinear_DequantizerLinear pair. For float16 and bfloat16 quantization, the quantization operators are the cast_cast pair. Mix quantization of int4/uint4 and int8/uint8 is not supported currently. In other words, if the model contains both quantized nodes of uint4/int4 and uint8/int8, this function cannot be used to export the ONNX graph.m
Example of Onnx Exporting#
export_path = "./output_dir"
batch_iter = iter(calib_dataloader)
input_args = next(batch_iter)
if args.quant_scheme in ["w_int4_per_channel_sym", "w_uint4_per_group_asym", "w_int4_per_group_sym", "w_uint4_a_bfloat16_per_group_asym"]:
uint4_int4_flag = True
else:
uint4_int4_flag = False
from quark.torch import ModelExporter
from quark.torch.export.config.custom_config import DEFAULT_EXPORTER_CONFIG
exporter = ModelExporter(config=DEFAULT_EXPORTER_CONFIG, export_dir=export_path)
exporter.export_onnx_model(model, input_args, uint4_int4_flag=uint4_int4_flag)
Json-Safetensors(vLLM Adopted) Exporting#
Json-safetensors exporting is specifically designed for the VLLM compiler. Currently, the supported quantization formats include only fp8 per_tensor quantization, weight-only int4 per_group quantization, and w4a8 per_group quantization. The models supported include Llama2-7b, Llama2-13b, Llama2-70b, and Llama3-8b.
Example of Json-Safetensors(vLLM Adopted) Exporting#
export_path = "./output_dir"
from quark.torch import ModelExporter
from quark.torch.export.config.custom_config import DEFAULT_EXPORTER_CONFIG
exporter = ModelExporter(config=DEFAULT_EXPORTER_CONFIG, export_dir=export_path)
exporter.export_model_info(model, model_type, model_dtype, export_type="vllm-adopt")