Getting Started with Quark for PyTorch#

If users want to quantize opt-125m model with the int8 data type for symmetric per tensor weight-only quantization, they can write their Python script as shown below. Detailed explanations for each step will be provided on other chapter of the User Guide.

# 1. Set model
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")

# 2. Set quantization configuration
from quark.torch.quantization.config.type import Dtype, ScaleType, RoundType, QSchemeType
from quark.torch.quantization.config.config import Config, QuantizationSpec, QuantizationConfig
from quark.torch.quantization.observer.observer import PerTensorMinMaxObserver
DEFAULT_INT8_PER_TENSOR_SYM_SPEC = QuantizationSpec(dtype=Dtype.int8,
                                        qscheme=QSchemeType.per_tensor,
                                        observer_cls=PerTensorMinMaxObserver,
                                        symmetric=True,
                                        scale_type=ScaleType.float,
                                        round_method=RoundType.half_even,
                                        is_dynamic=False)

DEFAULT_W_INT8_PER_TENSOR_CONFIG = QuantizationConfig(weight=DEFAULT_INT8_PER_TENSOR_SYM_SPEC)
quant_config = Config(global_quant_config=DEFAULT_W_INT8_PER_TENSOR_CONFIG)

# 3. Define calibration dataloader (still need this step for weight only and dynamic quantization)
from torch.utils.data import DataLoader
text = "Hello, how are you?"
tokenized_outputs = tokenizer(text, return_tensors="pt")
calib_dataloader = DataLoader(tokenized_outputs['input_ids'])

# 4. In-place replacement with quantized modules in model
from quark.torch import ModelQuantizer
quantizer = ModelQuantizer(quant_config)
quant_model = quantizer.quantize_model(model, calib_dataloader)

# # 5. (Optional) Export onnx
# # If user want to export the quantized model, please freeze the quantized model first
# freezed_quantized_model = quantizer.freeze(quant_model)
# from quark.torch import ModelExporter
# # Get dummy input
# for data in calib_dataloader:
#     input_args = data
#     break
# quant_model = quant_model.to('cuda')
# input_args = input_args.to('cuda')
# exporter = ModelExporter('export_path')
# exporter.export_onnx_model(quant_model, input_args)

If the code runs successfully, the terminal will display [QUARK-INFO]: Model quantization has been completed.