PyTorch model export and reloading#

Quark Exporting and Importing API for PyTorch.

quark.torch.export.api.export_safetensors(model: Module, output_dir: str | Path, custom_mode: str = 'quark', weight_format: str = 'real_quantized', pack_method: str = 'reorder') None[source]#

Export the quantized PyTorch model to Safetensors format.

The model’s network architecture or configuration is stored in the json file, and parameters including weight, bias, scale, and zero_point are stored in the safetensors file.

Parameters:
  • model (torch.nn.Module) – The quantized model to be exported.

  • output_dir (Union[str, Path]) – Directory to save the exported files.

  • custom_mode (str) –

    Export mode determining quantization handling. Defaults to "quark". Possible values are:

    • "quark": standard quark format. This is the default and recommended format that should be favored.

    • "awq": targets AutoAWQ library.

    • "fp8": targets vLLM-compatible fp8 models.

  • weight_format (str) –

    How to handle quantized parameters. Defaults to "real_quantized". Possible values are:

    • "real_quantized": actual quantized parameters.

    • "fake_quantized": QDQ (Quantize-Dequantize) representation of quantized parameters.

  • pack_method (str) –

    Real_quantized parameter packing strategy. Defaults to "reorder". Possible values are:

    • "reorder": reorder the real_quantized parameters layout for hardware.

    • "order": keep the original real_quantized parameters layout.

Returns:

None

Example:

from quark.torch import export_safetensors

export_path = "./output_dir"
export_safetensors(model, export_path, custom_mode="quark", weight_format="real_quantized", pack_method="reorder")
quark.torch.export.api.export_onnx(model: ~torch.nn.modules.module.Module, output_dir: str | ~pathlib.Path, input_args: tuple[~typing.Any, ...], opset_version: int | None = None, input_names: list[str] = [], output_names: list[str] = [], verbose: bool = False, do_constant_folding: bool = True, operator_export_type: ~torch._C._onnx.OperatorExportTypes = <OperatorExportTypes.ONNX: 0>, uint4_int4_flag: bool = False, dynamo: bool = False) None[source]#

Export the onnx graph of the quantized PyTorch model.

Parameters:
  • model (torch.nn.Module) – The quantized model to be exported.

  • output_dir (Union[str, Path]) – Directory to save the ONNX file

  • input_args (Union[torch.Tensor, Tuple[float]]) – Example inputs for ONNX tracing.

  • opset_version (Optional[int]) – The version of the ONNX opset to target. If not set, it will be valued the latest version that is stable for the current version of PyTorch. Defaults to None.

  • input_names (List[str]) – Names to assign to the input nodes of the onnx graph, in order. Defaults to [].

  • output_names (List[str]) – Names to assign to the output nodes of the onnx graph, in order. Defaults to [].

  • verbose (bool) – Flag to control showing verbose log or no. Defaults to False.

  • do_constant_folding (bool) – Flag to apply constant folding optimization. Defaults to True.

  • operator_export_type (torch.onnx.OperatorExportTypes) – Export operator type in onnx graph. The choices include OperatorExportTypes.ONNX, OperatorExportTypes.ONNX_FALLTHROUGH, OperatorExportTypes.ONNX_ATEN and OperatorExportTypes.ONNX_ATEN_FALLBACK. Defaults to OperatorExportTypes.ONNX.

  • uint4_int4_flag (bool) – Flag to indicate uint4/int4 quantized model or not. Defaults to False.

  • dynamo (bool) – Whether to export the model with torch.export ExportedProgram instead of TorchScript. Please refer to PyTorch documentation for more details. Defaults to False.

Returns:

None

Example:

from quark.torch import export_onnx

export_onnx(model, output_dir, input_args)
Note:

Mix quantization of int4/uint4 and int8/uint8 is not supported currently. In other words, if the model contains both quantized nodes of uint4/int4 and uint8/int8, this function cannot be used to export the ONNX graph.

quark.torch.export.api.export_gguf(model: Module, output_dir: str | Path, model_type: str, tokenizer_path: str | Path) None[source]#

Export the gguf file of the quantized PyTorch model.

Parameters:
  • model (torch.nn.Module) – The quantized model to be exported.

  • output_dir (Union[str, Path]) – Directory to save the GGUF file

  • model_type (str) – The model type of the model, e.g. "gpt2", "gptj", or "llama".

  • tokenizer_path (Union[str, Path]) – Tokenizer needs to be encoded into gguf model. This argument specifies the directory path of the tokenizer, which contains tokenizer.json, tokenizer_config.json and/or tokenizer.model.

Returns:

None

Example:

from quark.torch import export_gguf
export_gguf(model, output_dir, model_type, tokenizer_path)
Note:

Currently, only support asymetric int4 per_group weight-only quantization, and the group_size must be 32. Supported models include Llama2-7b, Llama2-13b, Llama2-70b, and Llama3-8b.

quark.torch.export.api.import_model_from_safetensors(model: Module, model_dir: str, multi_device: bool = False) Module[source]#

Imports a quantized model from the local directory model_dir into a non-quantized model model.

Parameters:
  • model (torch.nn.Module) – The non-quantized model, that will be transformed in place to a quantized model using the "quantization_config" in the config.json file retrieved in the local directory model_dir, and in which quantized weights will be loaded into.

  • model_dir (str) – Directory containing the model files (config.json and model.safetensors)

  • multi_device (bool) – Whether to use multi-device loading using Accelerate library. Defaults to False.

Returns:

The model with loaded weights and proper quantization modules.

quark.torch.export.api.save_params(model: Module, model_type: str, args: tuple[Any, ...] | None = None, kwargs: dict[str, Any] | None = None, export_dir: Path | str = '/tmp', quant_mode: QuantizationMode = QuantizationMode.eager_mode, compressed: bool = False, reorder: bool = True) None[source]#

Save the network architecture or configurations and parameters of the quantized model. For eager mode quantization, the model’s configurations are stored in json file, and parameters including weight, bias, scale, and zero_point are stored in safetensors file. For fx_graph mode quantization, the model’s network architecture and parameters are stored in pth file.

Parameters:
  • model (torch.nn.Module) – The quantized model to be saved.

  • model_type (str) – The type of the model, e.g. gpt2, gptj, llama or gptnext.

  • args (Optional[Tuple[Any, ...]]) – Example tuple inputs for this quantized model. Only available for fx_graph mode quantization. Default is None.

  • kwargs (Optional[Dict[str, Any]]) – Example dict inputs for this quantized model. Only available for fx_graph mode quantization. Default is None.

  • export_dir (Union[Path, str]) – The target export directory.

  • quant_mode (QuantizationMode) – The quantization mode. The choice includes QuantizationMode.eager_mode and QuantizationMode.fx_graph_mode. Default is QuantizationMode.eager_mode.

  • compressed (bool) – Export the compressed (real quantized) model or QDQ model, Default is False and it exports the QDQ model.

  • reorder (bool) – pack method, uses pack the weight (eg. packs four torch.int8 value into one torch.int32 value). Default is True.

Returns:

None

Examples:

# eager mode:
from quark.torch import save_params
save_params(model, model_type=model_type, export_dir="./save_dir")
# fx_graph mode:
from quark.torch.export.api import save_params
save_params(model,
            model_type=model_type,
            args=example_inputs,
            export_dir="./save_dir",
            quant_mode=QuantizationMode.fx_graph_mode)