Supported Data and Op Types

Supported Data and Op Types#

Supported Data Types#

Summary Table#

Supported Data Types
Int8 / UInt8
Int16 / UInt16
Int32 / UInt32
Float16
BFloat16
BFP16
MX4 / MX6 / MX9
MXFP8(E5M2) / MXFP8(E4M3) / MXFP6(E3M2) / MXFP6(E2M3) / MXFP4(E2M1) / MXINT8

You can see in the table there are many non integer data types that onnxruntime official operators do not support. In order to support these new features, we have developed several custom operators using onnxruntime’s custom operation C APIs. Here are these ops and their specifications:

ExtendedQuantizeLinear - specification

ExtendedDequantizeLinear - specification

BFPQuantizeDequantize - specification

MXQuantizeDequantize - specification

Note

When installing on Windows, Visual Studio is required. The minimum version of Visual Studio is Visual Studio 2022. During the compilation process, there are two ways to use it:

Use the Developer Command Prompt for Visual Studio When installing Visual Studio, ensure that the Developer Command Prompt for Visual Studio is installed as well. Execute programs in the CMD window of the Developer Command Prompt for Visual Studio.
Manually Add Paths to Environment Variables Visual Studio’s cl.exe, MSBuild.exe, and link.exe will be used. Ensure that the paths are added to the PATH environment variable. These programs are located in the Visual Studio installation directory. In the Edit Environment Variables window, click New, then paste the path to the folder containing cl.exe, link.exe, and MSBuild.exe. Click OK on all windows to apply the changes.

1. Quantizing to Other Precision Levels#

In addition to the INT8/UINT8, the quark.onnx supports quantizing models to other data formats, including INT16/UINT16, INT32/UINT32, Float16 and BFloat16, which can provide better accuracy or be used for experimental purposes. The code below is an example for Int16. The table below shows all data type specs.

from quark.onnx import ModelQuantizer, QConfig, QLayerConfig, Int16Spec

config = QConfig(global_config=QLayerConfig(input_tensors=Int16Spec(), weight=Int16Spec()))
quantizer = ModelQuantizer(config)
quantizer.quantize_model(model_input, model_output, calibration_data_reader)

Data Type	Spec
Int8	Int8Spec
UInt8	UInt8Spec
Int16	Int16Spec
UInt16	UInt16Spec
Int32	Int32Spec
UInt32	UInt32Spec
BFloat16	BFloat16Spec
BFP16	BFP16Spec
MX4	MX4Spec
MX6	MX6Spec
MX9	MX9Spec
MXFP4E2M1	MXFP4E2M1Spec
MXFP6E3M2	MXFP6E3M2Spec
MXFP6E2M3	MXFP6E2M3Spec
MXFP8E5M2	MXFP8E5M2Spec
MXFP8E4M3	MXFP8E4M3Spec
MXInt8	MXInt8Spec

Note

BFP16 and MX data types use custom ops. When inference with ONNX Runtime, we need to register the custom op’s so(Linux) or dll(Windows) file in the ORT session options.

import onnxruntime
from quark.onnx import get_library_path

device = 'CPU'
providers = ['CPUExecutionProvider']

# Also We can use the GPU configuration:
# device='ROCM'
# providers = ['ROCMExecutionProvider']
# device='CUDA'
# providers = ['CUDAExecutionProvider']

sess_options = onnxruntime.SessionOptions()
sess_options.register_custom_ops_library(get_library_path(device))
session = onnxruntime.InferenceSession(onnx_model_path, sess_options, providers=providers)

2. Quantizing Float16 Models#

For models in Float16, we recommend setting ConvertFP16ToFP32 to True in extra_options. This first converts your Float16 model to a Float32 model before quantization, reducing redundant nodes such as cast in the model.

from quark.onnx import ModelQuantizer, QConfig, QLayerConfig, Int8Spec

config = QConfig(global_config=QLayerConfig(input_tensors=Int8Spec(), weight=Int8Spec()),
                 extra_options={"ConvertFP16ToFP32": True})
quantizer = ModelQuantizer(config)
quantizer.quantize_model(model_input, model_output, calibration_data_reader)

Note

When using ConvertFP16ToFP32 in quark.onnx, it requires onnxslim to simplify the ONNX model. Ensure that onnxslim is installed by using python -m pip install onnxslim.

Supported Op Type#

Summary Table#

Note: For built-in configs except those with block floating-point data types, the extra option ForceQuantizeNoInputCheck is set to True by default. In that case, ops listed below as “quantized only when input is quantized” are always quantized (their inputs are quantized and quantized outputs are produced). For custom configs with ForceQuantizeNoInputCheck=False, those ops follow the input-dependent behavior.

Table: List of Quark ONNX Supported Quantized Ops

Supported Ops	Comments
Add
ArgMax
AveragePool	Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True).
BatchNormalization	By default, the “optimize_model” parameter will fuse BatchNormalization to Conv/ConvTranspose/Gemm. For standalone BatchNormalization, quantization is supported only for NPU_CNN platforms by converting BatchNormalization to Conv.
Clip	Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True).
Concat
Conv
ConvTranspose
DepthToSpace	Quantization is supported only for NPU_CNN platforms.
Div	Quantization is supported only for NPU_CNN platforms.
Erf	Quantization is supported only for NPU_CNN platforms.
Gather
Gemm
GlobalAveragePool
HardSigmoid	Quantization is supported only for NPU_CNN platforms.
InstanceNormalization
LayerNormalization	Supported for opset>=17. Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True).
LeakyRelu
LpNormalization	Quantization is supported only for NPU_CNN platforms.
MatMul
Min	Quantization is supported only for NPU_CNN platforms.
Max	Quantization is supported only for NPU_CNN platforms.
MaxPool	Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True).
Mul
Pad
PRelu	Quantization is supported only for NPU_CNN platforms.
ReduceMean	Quantization is supported only for NPU_CNN platforms.
Relu	Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True).
Reshape	Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True).
Resize
Slice	Quantization is supported only for NPU_CNN platforms.
Sigmoid
Softmax
SpaceToDepth	Quantization is supported only for NPU_CNN platforms.
Split
Squeeze	Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True).
Sub	Quantization is supported only for NPU_CNN platforms.
Tanh	Quantization is supported only for NPU_CNN platforms.
Transpose	Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True).
Unsqueeze	Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True).
Where

Supported Data and Op Types

Contents

Supported Data and Op Types#

Supported Data Types#

Summary Table#

1. Quantizing to Other Precision Levels#

2. Quantizing Float16 Models#

Supported Op Type#

Summary Table#