PyTorch model export and reloading#

Quark Exporting and Importing API for PyTorch.

class quark.torch.export.api.ModelExporter(config: ExporterConfig, export_dir: Path | str = '/tmp')[source]#

Provides an API for exporting quantized PyTorch deep learning models. This class converts the quantized model to json-pth, json-safetensors files or onnx graph, and saves to export_dir.

Parameters:

config (ExporterConfig) – Configuration object containing settings for exporting.
export_dir (Union[Path, str]) – The target export directory.

export_quark_model(model: Module, quant_config: Config, custom_mode: str = 'quark') → None[source]#

Exports the quantized PyTorch model to quark file format using json and pth files.

The model’s network architecture or configuration is stored in the json file, and parameters including weight, bias, scale, and zero_point are stored in the pth file.

Parameters:

model (torch.nn.Module) – The quantized model to be exported.
quant_config (Config) – Configuration object containing settings for quantization. Default is None.
custom_mode (str) –
Whether to export the quantization config and model in a custom format expected by some downstream library. Possible options:
- "quark": standard quark format. This is the default and recommended format that should be favored.
- "awq": targets AutoAWQ library.
- "fp8": targets vLLM-compatible fp8 models.

Returns:

None

Example:

# default exporting:
export_path = "./output_dir"
from quark.torch import ModelExporter
from quark.torch.export.config.config import ExporterConfig, JsonExporterConfig, OnnxExporterConfig
NO_MERGE_REALQ_CONFIG = JsonExporterConfig(weight_format="real_quantized",
                                           pack_method="reorder")
export_config = ExporterConfig(json_export_config=NO_MERGE_REALQ_CONFIG, onnx_export_config=OnnxExporterConfig())
exporter = ModelExporter(config=export_config, export_dir=export_path)
quant_config = get_config(args, model_type)
exporter.export_quark_model(model, quant_config=quant_config, custom_mode=args.custom_mode)

Note:: Currently, default exporting quark format (json + pth).

get_export_model(model: Module, quant_config: Config, custom_mode: str = 'quark', add_export_info_for_hf: bool = True) → Module[source]#

Merges scales, replaces modules of the quantized model to prepare for export, and add export information in config.json.

Scale merging selects the maximum scale value in specified weight_group as the scale for each module in the group.

Build kv_scale selects the maximum kv_scale value in kv_group as the scale for the key projection output quantization and value projection output quantization.

Module replacement converts the model’s module (e.g. QuantLinear) according to the weight_format (to QparamsLinear).

Parameters:

model (torch.nn.Module) – The quantized model to be exported.
quant_config (Config) – Model quantization configuration.
custom_mode (str) –
Whether to export the quantization config and model in a custom format expected by some downstream library. Possible options:
- "quark": standard quark format. This is the default and recommended format that should be favored.
- "awq": targets AutoAWQ library.
- "fp8": targets vLLM-compatible fp8 models.
add_export_info_for_hf (bool) – Whether to add export info of quark to config.json when using hf_format_export. When loading the model, we recover the kv_cache in autofp8 format through the weight file, but we need the name of kv_layer, it is very cumbersome to get it from quark’s map, it is more reasonable to get it from config. If we find kv_scale in weight_flie and there is no special kv_layer_name, we will use k_proj,v_proj to recover kv_cache by default.

reset_model(model: Module) → None[source]#: Restore exported model to freezed Model for inferring, restore config content.

export_onnx_model(model: ~torch.nn.modules.module.Module, input_args: ~torch.Tensor | ~typing.Tuple[float], input_names: ~typing.List[str] = [], output_names: ~typing.List[str] = [], verbose: bool = False, opset_version: int | None = None, do_constant_folding: bool = True, operator_export_type: ~torch._C._onnx.OperatorExportTypes = <OperatorExportTypes.ONNX: 0>, uint4_int4_flag: bool = False) → None[source]#

This function aims to export onnx graph of the quantized PyTorch model.

Parameters:

model (torch.nn.Module) – The quantized model to be exported.
input_args (Union[torch.Tensor, Tuple[float]]) – Example inputs for this quantized model.
input_names (List[str]) – Names to assign to the input nodes of the onnx graph, in order. Defaults to [].
output_names (List[str]) – Names to assign to the output nodes of the onnx graph, in order. Defaults to [].
verbose (bool) – Flag to control showing verbose log or no. Default is False.
opset_version (Optional[int]) – The version of the default (ai.onnx) opset to target. If not set, it will be valued the latest version that is stable for the current version of PyTorch. Defaults to None.
operator_export_type (torch.onnx.OperatorExportTypes) – Export operator type in onnx graph. The choices include OperatorExportTypes.ONNX, OperatorExportTypes.ONNX_FALLTHROUGH, OperatorExportTypes.ONNX_ATEN and OperatorExportTypes.ONNX_ATEN_FALLBACK. Default is OperatorExportTypes.ONNX.
uint4_int4_flag (bool) – Flag to indicate uint4/int4 quantized model or not. Default is False.

Returns:

None

Example:

from quark.torch import ModelExporter
from quark.torch.export.config.config import ExporterConfig, JsonExporterConfig

export_config = ExporterConfig(json_export_config=JsonExporterConfig())
exporter = ModelExporter(config=export_config, export_dir=export_path)
exporter.export_onnx_model(model, input_args)

Note:: Mix quantization of int4/uint4 and int8/uint8 is not supported currently. In other words, if the model contains both quantized nodes of uint4/int4 and uint8/int8, this function cannot be used to export the ONNX graph.

export_gguf_model(model: Module, tokenizer_path: str | Path, model_type: str) → None[source]#

This function aims to export gguf file of the quantized PyTorch model.

Parameters:

model (torch.nn.Module) – The quantized model to be exported.
model_type (str) – Tokenizer needs to be encoded into gguf model. This argument specifies the directory path of the tokenizer, which contains tokenizer.json, tokenizer_config.json and/or tokenizer.model.
model_type – The model type of the model, e.g. "gpt2", "gptj", or "llama".

Returns:

None

Example:

from quark.torch import ModelExporter
from quark.torch.export.config.config import ExporterConfig, JsonExporterConfig
export_config = ExporterConfig(json_export_config=JsonExporterConfig())
exporter = ModelExporter(config=export_config, export_dir=export_path)
exporter.export_gguf_model(model, tokenizer_path, model_type)

Note:: Currently, only support asymetric int4 per_group weight-only quantization, and the group_size must be 32. Supported models include Llama2-7b, Llama2-13b, Llama2-70b, and Llama3-8b.

export_safetensors_model(model: Module, quant_config: Config, custom_mode: str = 'quark', **kwargs: Any) → None[source]#

Exports the quantized PyTorch model to the safetensors format.

Parameters:

model (torch.nn.Module) – The quantized model to be exported.
quant_config (Config) – Configuration object containing settings for quantization. Default is None.
custom_mode (str) –
Whether to export the quantization config and model in a custom format expected by some downstream library. Possible options:
- "quark": standard quark format. This is the default and recommended format that should be favored.
- "awq": targets AutoAWQ library.
- "fp8": targets vLLM-compatible fp8 models.

quark.torch.export.api.save_params(model: Module, model_type: str, args: Tuple[Any, ...] | None = None, kwargs: Dict[str, Any] | None = None, export_dir: Path | str = '/tmp', quant_mode: QuantizationMode = QuantizationMode.eager_mode, compressed: bool = False, reorder: bool = True) → None[source]#

Save the network architecture or configurations and parameters of the quantized model. For eager mode quantization, the model’s configurations are stored in json file, and parameters including weight, bias, scale, and zero_point are stored in safetensors file. For fx_graph mode quantization, the model’s network architecture and parameters are stored in pth file.

Parameters:

model (torch.nn.Module) – The quantized model to be saved.
model_type (str) – The type of the model, e.g. gpt2, gptj, llama or gptnext.
args (Optional[Tuple[Any, ...]]) – Example tuple inputs for this quantized model. Only available for fx_graph mode quantization. Default is None.
kwargs (Optional[Dict[str, Any]]) – Example dict inputs for this quantized model. Only available for fx_graph mode quantization. Default is None.
export_dir (Union[Path, str]) – The target export directory.
quant_mode (QuantizationMode) – The quantization mode. The choice includes QuantizationMode.eager_mode and QuantizationMode.fx_graph_mode. Default is QuantizationMode.eager_mode.
compressed (bool) – Export the compressed (real quantized) model or QDQ model, Default is False and it exports the QDQ model.
reorder (bool) – pack method, uses pack the weight (eg. packs four torch.int8 value into one torch.int32 value). Default is True.

Returns:

None

Examples:

# eager mode:
from quark.torch import save_params
save_params(model, model_type=model_type, export_dir="./save_dir")

# fx_graph mode:
from quark.torch.export.api import save_params
save_params(model,
            model_type=model_type,
            args=example_inputs,
            export_dir="./save_dir",
            quant_mode=QuantizationMode.fx_graph_mode)

class quark.torch.export.api.ModelImporter(model_info_dir: str, saved_format: str = 'quark_format', multi_device: bool = False)[source]#

Provides an API for importing quantized PyTorch deep learning models. This class load json-pth or json-safetensors files to model.

Parameters:

model_info_dir (str) – The target import directory.
saved_format (str) – Specifies the format to load from. This can be "quark_format" or "hf_format" (or "safetensors"). Defaults to "quark_format". multi_device (bool): Whether or not to use gpu + cpu mode to import models via “accelerate”.

import_model_info(model: Module) → Module[source]#

Reloads a serialized quantized model, based on the non-quantized module.

This function aims to import quark(json-pth) files of the Hugging Face large language model.

It could recover the weight, bias, scale, and zeropoint information of the model and execute the inference.

Parameters:: model (torch.nn.Module) – The original Hugging Face large language model.
Returns:: Model with quantized weights and modules.
Return type:: torch.nn.Module

Example:

from quark.torch import ModelImporter

model_importer = ModelImporter(model_info_dir="./import_model_dir")
model = importer.import_model_info(model)

PyTorch model export and reloading

Contents

PyTorch model export and reloading#