Quantization Strategies#

AMD Quark for ONNX offers three distinct quantization strategies tailored to meet the requirements of various hardware backends:

Post Training Weight-Only Quantization: Quantizes the weights ahead of time, but the input_tensors are not quantized (using the original float data type) during inference.
Post Training Static Quantization: Quantizes both the weights and input_tensors in the model. To achieve the best results, this process necessitates calibration with a dataset that accurately represents the actual data, which allows for precise determination of the optimal quantization parameters for input_tensors.
Post Training Dynamic Quantization: Quantizes the weights ahead of time, while the input_tensors are quantized dynamically at runtime. This method allows for a more flexible approach, especially when the input_tensors distribution is not well-known or varies significantly during inference.

The strategies share the same API. You simply need to set the strategy through the quantization configuration, as demonstrated in the previous example. For more details about setting quantization configuration, refer to the “Configuring AMD Quark for ONNX” chapter.