INT8/INT16 Quantization for Ryzen AI

INT8/INT16 Quantization for Ryzen AI#

Introduction#

Ryzen AI NPU supports two major categories of quantization schemes, which differ primarily in how scaling factors are represented and applied during quantization:

Power-of-Two Scales Quantization (XINT8) This method uses power-of-two scaling factors with symmetric INT8 quantization for activations, weights, and biases. It is highly optimized for performance on Ryzen AI hardware, enabling efficient computation through bit-shift operations. However, due to its constrained scaling scheme, it may introduce accuracy degradation for certain models.
Float Scales Quantization (A8W8 and A16W8) These methods use floating-point scaling factors, providing greater flexibility and typically better accuracy. - A8W8 uses INT8 activations and weights. - A16W8 increases activation precision to INT16, further improving accuracy at the cost of performance.

Choosing between these approaches involves a trade-off between performance and accuracy:

Use XINT8 when maximum performance is required and minor accuracy loss is acceptable.
Use A8W8 for a balance between performance and accuracy.
Use A16W8 when higher accuracy is critical and additional compute cost is acceptable.

This documentation provides detailed guides for both quantization approaches, including: - How to quantize float models - How to evaluate quantization accuracy - Techniques to improve quantized model performance

Summary of Differences#

Feature	XINT8	A8W8 / A16W8
Scale Type	Power-of-Two	Floating-point
Activation Precision	INT8	INT8 (A8W8), INT16 (A16W8)
Weight Precision	INT8	INT8
Bias Precision	INT32	INT32
Quantization Scheme	Symmetric	Symmetric
Performance	Highest	Medium (A8W8) / Lower (A16W8)
Accuracy	Lower (in some cases)	Higher
Hardware Efficiency	Optimal (bit-shift ops)	Less optimal

When to Use Each Method#

Choose XINT8 if: - You need maximum throughput on Ryzen AI NPU - Your model tolerates stricter quantization constraints - You prioritize latency and efficiency over accuracy
Choose A8W8 if: - You want a balance between performance and accuracy - Your model shows noticeable degradation with XINT8
Choose A16W8 if: - Accuracy is critical - Your model is sensitive to quantization errors - You can afford slightly lower performance

Next Steps#

To get started, refer to the detailed guides:

Each guide includes: - Step-by-step quantization workflow - Example code snippets - Accuracy evaluation methods - Advanced techniques such as AdaRound and AdaQuant for improving results