INT8/INT16 Quantization for Ryzen AI#
Introduction#
Ryzen AI NPU supports two major categories of quantization schemes, which differ primarily in how scaling factors are represented and applied during quantization:
Power-of-Two Scales Quantization (XINT8) This method uses power-of-two scaling factors with symmetric INT8 quantization for activations, weights, and biases. It is highly optimized for performance on Ryzen AI hardware, enabling efficient computation through bit-shift operations. However, due to its constrained scaling scheme, it may introduce accuracy degradation for certain models.
Float Scales Quantization (A8W8 and A16W8) These methods use floating-point scaling factors, providing greater flexibility and typically better accuracy. - A8W8 uses INT8 activations and weights. - A16W8 increases activation precision to INT16, further improving accuracy at the cost of performance.
Choosing between these approaches involves a trade-off between performance and accuracy:
Use XINT8 when maximum performance is required and minor accuracy loss is acceptable.
Use A8W8 for a balance between performance and accuracy.
Use A16W8 when higher accuracy is critical and additional compute cost is acceptable.
This documentation provides detailed guides for both quantization approaches, including: - How to quantize float models - How to evaluate quantization accuracy - Techniques to improve quantized model performance
Summary of Differences#
Feature |
XINT8 |
A8W8 / A16W8 |
|---|---|---|
Scale Type |
Power-of-Two |
Floating-point |
Activation Precision |
INT8 |
INT8 (A8W8), INT16 (A16W8) |
Weight Precision |
INT8 |
INT8 |
Bias Precision |
INT32 |
INT32 |
Quantization Scheme |
Symmetric |
Symmetric |
Performance |
Highest |
Medium (A8W8) / Lower (A16W8) |
Accuracy |
Lower (in some cases) |
Higher |
Hardware Efficiency |
Optimal (bit-shift ops) |
Less optimal |
When to Use Each Method#
Choose XINT8 if: - You need maximum throughput on Ryzen AI NPU - Your model tolerates stricter quantization constraints - You prioritize latency and efficiency over accuracy
Choose A8W8 if: - You want a balance between performance and accuracy - Your model shows noticeable degradation with XINT8
Choose A16W8 if: - Accuracy is critical - Your model is sensitive to quantization errors - You can afford slightly lower performance
Next Steps#
To get started, refer to the detailed guides:
Each guide includes: - Step-by-step quantization workflow - Example code snippets - Accuracy evaluation methods - Advanced techniques such as AdaRound and AdaQuant for improving results