PyTorch model export and reloading#
Quark Exporting and Importing API for PyTorch.
- quark.torch.export.api.export_safetensors(model: Module, output_dir: str | Path, custom_mode: str = 'quark', weight_format: str = 'real_quantized', pack_method: str = 'reorder') None[source]#
Export the quantized PyTorch model to Safetensors format.
The model’s network architecture or configuration is stored in the json file, and parameters including weight, bias, scale, and zero_point are stored in the safetensors file.
- Parameters:
model (torch.nn.Module) – The quantized model to be exported.
output_dir (Union[str, Path]) – Directory to save the exported files.
custom_mode (str) –
Export mode determining quantization handling. Defaults to
"quark". Possible values are:"quark": standard quark format. This is the default and recommended format that should be favored."awq": targets AutoAWQ library."fp8": targets vLLM-compatible fp8 models.
weight_format (str) –
How to handle quantized parameters. Defaults to
"real_quantized". Possible values are:"real_quantized": actual quantized parameters."fake_quantized": QDQ (Quantize-Dequantize) representation of quantized parameters.
pack_method (str) –
Real_quantized parameter packing strategy. Defaults to
"reorder". Possible values are:"reorder": reorder the real_quantized parameters layout for hardware."order": keep the original real_quantized parameters layout.
- Returns:
None
Example:
from quark.torch import export_safetensors export_path = "./output_dir" export_safetensors(model, export_path, custom_mode="quark", weight_format="real_quantized", pack_method="reorder")
- quark.torch.export.api.export_onnx(model: ~torch.nn.modules.module.Module, output_dir: str | ~pathlib.Path, input_args: tuple[~typing.Any, ...], opset_version: int | None = None, input_names: list[str] = [], output_names: list[str] = [], verbose: bool = False, do_constant_folding: bool = True, operator_export_type: ~torch._C._onnx.OperatorExportTypes = <OperatorExportTypes.ONNX: 0>, uint4_int4_flag: bool = False, dynamo: bool = False) None[source]#
Export the onnx graph of the quantized PyTorch model.
- Parameters:
model (torch.nn.Module) – The quantized model to be exported.
output_dir (Union[str, Path]) – Directory to save the ONNX file
input_args (Union[torch.Tensor, Tuple[float]]) – Example inputs for ONNX tracing.
opset_version (Optional[int]) – The version of the ONNX opset to target. If not set, it will be valued the latest version that is stable for the current version of PyTorch. Defaults to
None.input_names (List[str]) – Names to assign to the input nodes of the onnx graph, in order. Defaults to
[].output_names (List[str]) – Names to assign to the output nodes of the onnx graph, in order. Defaults to
[].verbose (bool) – Flag to control showing verbose log or no. Defaults to
False.do_constant_folding (bool) – Flag to apply constant folding optimization. Defaults to
True.operator_export_type (torch.onnx.OperatorExportTypes) – Export operator type in onnx graph. The choices include
OperatorExportTypes.ONNX,OperatorExportTypes.ONNX_FALLTHROUGH,OperatorExportTypes.ONNX_ATENandOperatorExportTypes.ONNX_ATEN_FALLBACK. Defaults toOperatorExportTypes.ONNX.uint4_int4_flag (bool) – Flag to indicate uint4/int4 quantized model or not. Defaults to
False.dynamo (bool) – Whether to export the model with
torch.exportExportedProgram instead of TorchScript. Please refer to PyTorch documentation for more details. Defaults toFalse.
- Returns:
None
Example:
from quark.torch import export_onnx export_onnx(model, output_dir, input_args)
- Note:
Mix quantization of int4/uint4 and int8/uint8 is not supported currently. In other words, if the model contains both quantized nodes of uint4/int4 and uint8/int8, this function cannot be used to export the ONNX graph.
- quark.torch.export.api.export_gguf(model: Module, output_dir: str | Path, model_type: str, tokenizer_path: str | Path) None[source]#
Export the gguf file of the quantized PyTorch model.
- Parameters:
model (torch.nn.Module) – The quantized model to be exported.
output_dir (Union[str, Path]) – Directory to save the GGUF file
model_type (str) – The model type of the model, e.g.
"gpt2","gptj", or"llama".tokenizer_path (Union[str, Path]) – Tokenizer needs to be encoded into gguf model. This argument specifies the directory path of the tokenizer, which contains tokenizer.json, tokenizer_config.json and/or tokenizer.model.
- Returns:
None
Example:
from quark.torch import export_gguf export_gguf(model, output_dir, model_type, tokenizer_path)
- Note:
Currently, only support asymetric int4 per_group weight-only quantization, and the group_size must be 32. Supported models include Llama2-7b, Llama2-13b, Llama2-70b, and Llama3-8b.
- quark.torch.export.api.import_model_from_safetensors(model: Module, model_dir: str, multi_device: bool = False) Module[source]#
Imports a quantized model from the local directory
model_dirinto a non-quantized modelmodel.- Parameters:
model (torch.nn.Module) – The non-quantized model, that will be transformed in place to a quantized model using the
"quantization_config"in theconfig.jsonfile retrieved in the local directorymodel_dir, and in which quantized weights will be loaded into.model_dir (str) – Directory containing the model files (
config.jsonandmodel.safetensors)multi_device (bool) – Whether to use multi-device loading using Accelerate library. Defaults to
False.
- Returns:
The model with loaded weights and proper quantization modules.
- quark.torch.export.api.save_params(model: Module, model_type: str, args: tuple[Any, ...] | None = None, kwargs: dict[str, Any] | None = None, export_dir: Path | str = '/tmp', quant_mode: QuantizationMode = QuantizationMode.eager_mode, compressed: bool = False, reorder: bool = True) None[source]#
Save the network architecture or configurations and parameters of the quantized model. For eager mode quantization, the model’s configurations are stored in json file, and parameters including weight, bias, scale, and zero_point are stored in safetensors file. For fx_graph mode quantization, the model’s network architecture and parameters are stored in pth file.
- Parameters:
model (torch.nn.Module) – The quantized model to be saved.
model_type (str) – The type of the model, e.g. gpt2, gptj, llama or gptnext.
args (Optional[Tuple[Any, ...]]) – Example tuple inputs for this quantized model. Only available for fx_graph mode quantization. Default is
None.kwargs (Optional[Dict[str, Any]]) – Example dict inputs for this quantized model. Only available for fx_graph mode quantization. Default is
None.export_dir (Union[Path, str]) – The target export directory.
quant_mode (QuantizationMode) – The quantization mode. The choice includes
QuantizationMode.eager_modeandQuantizationMode.fx_graph_mode. Default isQuantizationMode.eager_mode.compressed (bool) – Export the compressed (real quantized) model or QDQ model, Default is
Falseand it exports the QDQ model.reorder (bool) – pack method, uses pack the weight (eg. packs four
torch.int8value into onetorch.int32value). Default isTrue.
- Returns:
None
Examples:
# eager mode: from quark.torch import save_params save_params(model, model_type=model_type, export_dir="./save_dir")
# fx_graph mode: from quark.torch.export.api import save_params save_params(model, model_type=model_type, args=example_inputs, export_dir="./save_dir", quant_mode=QuantizationMode.fx_graph_mode)