Supported Data and Op Types#
Supported Data Types#
Summary Table#
Supported Data Types |
|---|
Int8 / UInt8 |
Int16 / UInt16 |
Int32 / UInt32 |
Float16 |
BFloat16 |
BFP16 |
MX4 / MX6 / MX9 |
MXFP8(E5M2) / MXFP8(E4M3) / MXFP6(E3M2) / MXFP6(E2M3) / MXFP4(E2M1) / MXINT8 |
You can see in the table there are many non integer data types that onnxruntime official operators do not support. In order to support these new features, we have developed several custom operators using onnxruntime’s custom operation C APIs. Here are these ops and their specifications:
ExtendedQuantizeLinear - specification
ExtendedDequantizeLinear - specification
BFPQuantizeDequantize - specification
MXQuantizeDequantize - specification
Note
When installing on Windows, Visual Studio is required. The minimum version of Visual Studio is Visual Studio 2022. During the compilation process, there are two ways to use it:
Use the Developer Command Prompt for Visual Studio When installing Visual Studio, ensure that the Developer Command Prompt for Visual Studio is installed as well. Execute programs in the CMD window of the Developer Command Prompt for Visual Studio.
Manually Add Paths to Environment Variables Visual Studio’s
cl.exe,MSBuild.exe, andlink.exewill be used. Ensure that the paths are added to the PATH environment variable. These programs are located in the Visual Studio installation directory. In the Edit Environment Variables window, click New, then paste the path to the folder containingcl.exe,link.exe, andMSBuild.exe. Click OK on all windows to apply the changes.
1. Quantizing to Other Precision Levels#
In addition to the INT8/UINT8, the quark.onnx supports quantizing models to other data formats, including INT16/UINT16, INT32/UINT32, Float16 and BFloat16, which can provide better accuracy or be used for experimental purposes. The code below is an example for Int16. The table below shows all data type specs.
from quark.onnx import ModelQuantizer, QConfig, QLayerConfig, Int16Spec
config = QConfig(global_config=QLayerConfig(input_tensors=Int16Spec(), weight=Int16Spec()))
quantizer = ModelQuantizer(config)
quantizer.quantize_model(model_input, model_output, calibration_data_reader)
Data Type |
Spec |
|---|---|
Int8 |
Int8Spec |
UInt8 |
UInt8Spec |
Int16 |
Int16Spec |
UInt16 |
UInt16Spec |
Int32 |
Int32Spec |
UInt32 |
UInt32Spec |
BFloat16 |
BFloat16Spec |
BFP16 |
BFP16Spec |
MX4 |
MX4Spec |
MX6 |
MX6Spec |
MX9 |
MX9Spec |
MXFP4E2M1 |
MXFP4E2M1Spec |
MXFP6E3M2 |
MXFP6E3M2Spec |
MXFP6E2M3 |
MXFP6E2M3Spec |
MXFP8E5M2 |
MXFP8E5M2Spec |
MXFP8E4M3 |
MXFP8E4M3Spec |
MXInt8 |
MXInt8Spec |
Note
BFP16 and MX data types use custom ops. When inference with ONNX Runtime, we need to register the custom op’s so(Linux) or dll(Windows) file in the ORT session options.
import onnxruntime
from quark.onnx import get_library_path
device = 'CPU'
providers = ['CPUExecutionProvider']
# Also We can use the GPU configuration:
# device='ROCM'
# providers = ['ROCMExecutionProvider']
# device='CUDA'
# providers = ['CUDAExecutionProvider']
sess_options = onnxruntime.SessionOptions()
sess_options.register_custom_ops_library(get_library_path(device))
session = onnxruntime.InferenceSession(onnx_model_path, sess_options, providers=providers)
2. Quantizing Float16 Models#
For models in Float16, we recommend setting ConvertFP16ToFP32 to True in extra_options. This first converts your Float16 model to a Float32 model before quantization, reducing redundant nodes such as cast in the model.
from quark.onnx import ModelQuantizer, QConfig, QLayerConfig, Int8Spec
config = QConfig(global_config=QLayerConfig(input_tensors=Int8Spec(), weight=Int8Spec()),
extra_options={"ConvertFP16ToFP32": True})
quantizer = ModelQuantizer(config)
quantizer.quantize_model(model_input, model_output, calibration_data_reader)
Note
When using ConvertFP16ToFP32 in quark.onnx, it requires onnxslim to simplify the ONNX model. Ensure that onnxslim is installed by using python -m pip install onnxslim.
Supported Op Type#
Summary Table#
Note: For built-in configs except those with block floating-point data types, the extra option ForceQuantizeNoInputCheck is set to True by default. In that case, ops listed below as “quantized only when input is quantized” are always quantized (their inputs are quantized and quantized outputs are produced). For custom configs with ForceQuantizeNoInputCheck=False, those ops follow the input-dependent behavior.
Table: List of Quark ONNX Supported Quantized Ops
Supported Ops |
Comments |
|---|---|
Add |
|
ArgMax |
|
AveragePool |
Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True). |
BatchNormalization |
By default, the “optimize_model” parameter will fuse BatchNormalization to Conv/ConvTranspose/Gemm. For standalone BatchNormalization, quantization is supported only for NPU_CNN platforms by converting BatchNormalization to Conv. |
Clip |
Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True). |
Concat |
|
Conv |
|
ConvTranspose |
|
DepthToSpace |
Quantization is supported only for NPU_CNN platforms. |
Div |
Quantization is supported only for NPU_CNN platforms. |
Erf |
Quantization is supported only for NPU_CNN platforms. |
Gather |
|
Gemm |
|
GlobalAveragePool |
|
HardSigmoid |
Quantization is supported only for NPU_CNN platforms. |
InstanceNormalization |
|
LayerNormalization |
Supported for opset>=17. Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True). |
LeakyRelu |
|
LpNormalization |
Quantization is supported only for NPU_CNN platforms. |
MatMul |
|
Min |
Quantization is supported only for NPU_CNN platforms. |
Max |
Quantization is supported only for NPU_CNN platforms. |
MaxPool |
Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True). |
Mul |
|
Pad |
|
PRelu |
Quantization is supported only for NPU_CNN platforms. |
ReduceMean |
Quantization is supported only for NPU_CNN platforms. |
Relu |
Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True). |
Reshape |
Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True). |
Resize |
|
Slice |
Quantization is supported only for NPU_CNN platforms. |
Sigmoid |
|
Softmax |
|
SpaceToDepth |
Quantization is supported only for NPU_CNN platforms. |
Split |
|
Squeeze |
Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True). |
Sub |
Quantization is supported only for NPU_CNN platforms. |
Tanh |
Quantization is supported only for NPU_CNN platforms. |
Transpose |
Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True). |
Unsqueeze |
Quantized only when its input is quantized (or always when ForceQuantizeNoInputCheck=True). |
Where |