Release Note#

New Features (Version 0.2.0)#

  • Quark for PyTorch

  • Quark for ONNX

    • ONNX Quantizer Enhancements:

      • Multiple optimization and refinement strategies for different deployment backends.

      • Supported automatic mixing precision to balance accuracy and performance.

    • Quantization Strategy:

      • Supported symmetric and asymmetric quantization.

      • Supported float scale, INT16 scale and power-of-two scale.

      • Supported static quantization and weight-only quantization.

    • Quantization Granularity:

      • Supported for per-tensor and per-channel granularity.

    • Data Types:

      • Multiple data types are supported, including INT32/UINT32, Float16, Bfloat16, INT16/UINT16, INT8/UINT8 and BFP.

    • Calibration Methods:

      • MinMax, Entropy and Percentile for float scale.

      • MinMax for INT16 scale.

      • NonOverflow and MinMSE for power-of-two scale.

    • Custom operations:

      • “BFPFixNeuron” which supports block floating-point data type.

      • “VitisQuantizeLinear” and “VitisDequantizeLinear” which support INT32/UINT32, Float16, Bfloat16, INT16/UINT16 quantization.

      • “VitisInstanceNormalization” and “VitisLSTM” which have customized Bfloat16 kernels.

      • All custom operations only support running on CPU.

    • Advanced Quantization Algorithms:

      • Supported CLE, BiasCorrection, AdaQuant, AdaRound and SmoothQuant.

    • Operating System Support:

      • Linux and Windows.

New Features (Version 0.1.0)#

  • Quark for PyTorch

    • Pytorch Quantizer Enhancements:

      • Eager mode is supported.

      • Post Training Quantization (PTQ) is now available.

      • Automatic in-place replacement of nn.module operations.

      • Quantization of the following modules is supported: torch.nn.linear.

      • The customizable calibration process is introduced.

    • Quantization Strategy:

      • Symmetric and asymmetric quantization are supported.

      • Weight-only, dynamic, and static quantization modes are available.

    • Quantization Granularity:

      • Support for per-tensor, per-channel, and per-group granularity.

    • Data Types:

      • Multiple data types are supported, including float16, bfloat16, int4, uint4, int8, and fp8 (e4m3fn).

    • Calibration Methods:

      • MinMax, Percentile, and MSE calibration methods are now supported.

    • Large Language Model Support:

      • FP8 KV-cache quantization for large language models(LLMs).

    • Advanced Quantization Algorithms:

      • Support SmoothQuant, AWQ(uint4), and GPTQ(uint4) for LLMs. (Note: AWQ/GPTQ/SmoothQuant algorithms are currently limited to single GPU usage.)

    • Export Capabilities:

      • Export of Q/DQ quantized models to ONNX and vLLM-adopted JSON-safetensors format now supported.

    • Operating System Support:

      • Linux (supports ROCM and CUDA)

      • Windows (support CPU only).