ExtendedDequantizeLinear#

ExtendedDequantizeLinear - 1#

Version#

  • name: ExtendedDequantizeLinear

  • domain: com.amd.quark

Summary#

This operator extends the official ONNX DequantizeLinear operator by adding early support for uint16 and int16 quantization, along with additional support for bfloat16, float16, and uint32. It enables floating-point quantization to deploy models on edge devices and wide-bit quantization to facilitate detailed analysis of accuracy bottlenecks.

The proposed ExtendedDequantizeLinear operator enhances the official DequantizeLinear by supporting additional data types while maintaining backward compatibility with the official one. If you want to try these new data types, you can consider using this operator, but note that you should register our custom operator library to onnxruntime before running the quantized model, and since onnxruntime will not convert the quantized node to a QOperator anymore, there is no additional acceleration effect at runtime.

The linear dequantization operator. It consumes a quantized tensor, a scale, and a zero point to compute the full-precision tensor. The dequantization formula is y = (x - x_zero_point) * x_scale. x_scale and x_zero_point must have the same shape, determining the quantization’s granularity: a scalar for per-tensor/per-layer quantization, a 1-D tensor for per-axis/per-channel quantization. See ExtendedQuantizeLinear for details on quantization granularity.

x_zero_point and x must have the same type. x and y must have the same shape. In the case of dequantizing int32, there’s no zero point (zero point is supposed to be 0). zero-point is usually not used in the case of float16 and bfloat16 types quantization, but the dequantization formula remains the same for consistency. The output type is the same as x_scale, it also determines the precision of the multiplication operation.

Attributes#

axis - INT (default is ‘1’):

(Optional) The axis of the dequantizing dimension of the input tensor. Used for per-axis/per-channel quantization. Negative value means counting dimensions from the back. Accepted range is [-r, r-1] where r = rank(input).

Inputs#

Between 2 and 3 inputs.

  • x (heterogeneous) - T1:

N-D quantized input tensor to be de-quantized.

  • x_scale (heterogeneous) - T2:

Scale for input x. For per-tensor/per-layer dequantization the scale is a scalar, for per-axis/per-channel dequantization it is a 1-D Tensor.

  • x_zero_point (optional, heterogeneous) - T3:

Zero point for input x. Shape must match x_scale. It’s optional. Zero point is 0 when it’s not specified.

Outputs#

  • y (heterogeneous) - T3:

N-D full precision output tensor. It has the same shape as input x.

Type Constraints#

  • T1 in ( tensor(int32), tensor(int16), tensor(int8), tensor(uint32), tensor(uint16), tensor(uint8), tensor(float16), tensor(bfloat16) ):

The type of the input ‘x’.

  • T2 in ( tensor(float) ):

The type of the input ‘y_scale’.

  • T3 in ( tensor(float) ):

The type of the output ‘y’.