Introduction#

BFloat16 (Brain Floating Point 16) is a floating-point data format used in deep learning to reduce memory usage and computation while maintaining sufficient numerical precision. Unlike other quantization formats like INT8 or FP16, BF16 maintains the same range as FP32 but reduces precision, making it particularly useful for training and inference in neural networks.

AMD accelerators like latest CPU, NPU and GPU devices support BF16 natively, enabling faster matrix operations and reducing latency. In this tutorial, we will explain how to quantize a model into BF16 using Quark.

BF16 quantization in Quark for ONNX#

Here is a simple example of how to enable BF16 quantization.

from quark.onnx import ModelQuantizer, VitisQuantType, VitisQuantFormat
from onnxruntime.quantization.calibrate import CalibrationMethod
from quark.onnx.quantization.config.config import Config, QuantizationConfig

quant_config = QuantizationConfig(calibrate_method=CalibrationMethod.MinMax,
                                  quant_format=VitisQuantFormat.QDQ,
                                  activation_type=VitisQuantType.QBFloat16,
                                  weight_type=VitisQuantType.QBFloat16,
                                  )

config = Config(global_quant_config=quant_config)

quantizer = ModelQuantizer(config)

quantizer.quantize_model(input_model_path, output_model_path, data_reader)

The BF16 quantization in the above example inserts a custom Q/DQ pair for each tensor, which converts the model weights and activations from FP32 to BF16 directly, just as most frameworks do.

In fact, BF16 has the same range as FP32, but with only 7 bits for the mantissa, it sacrifices precision. This means small differences between numbers can disappear, which can amplify numerical instability and cause overflow problems.

To address the overflow issue in BF16 quantization, we can apply calibration and re-scale weights and activations to better align with dynamic range and utilize the dense numeric area near zero of BF16. Users to enable this just set ‘WeightScaled’ or ‘ActivationScaled’ in extra options if seeing overflow issues.

quant_config = QuantizationConfig(calibrate_method=CalibrationMethod.MinMax,
                                  quant_format=VitisQuantFormat.QDQ,
                                  activation_type=VitisQuantType.QBFloat16,
                                  weight_type=VitisQuantType.QBFloat16,
                                  extra_options={
                                      'WeightScaled': True,
                                      'ActivationScaled': True,
                                  }
                                 )

Note : When inference with ONNXRuntime, we need to register the custom op’s so(Linux) or dll(Windows) file in the ORT session options.

import onnxruntime
from quark.onnx import get_library_path

if 'ROCMExecutionProvider' in onnxruntime.get_available_providers():
    device = 'ROCM'
    providers = ['ROCMExecutionProvider']
elif 'CUDAExecutionProvider' in onnxruntime.get_available_providers():
    device = 'CUDA'
    providers = ['CUDAExecutionProvider']
else:
    device = 'CPU'
    providers = ['CPUExecutionProvider']

sess_options = onnxruntime.SessionOptions()
sess_options.register_custom_ops_library(get_library_path(device))
session = onnxruntime.InferenceSession(onnx_model_path, sess_options, providers=providers)

How to further improve the accuracy for BF16 quantization?#

We can finetune the quantized model to further improve the accuracy of BF16 quantization. The Fast Finetuning function in Quark for ONNX includes two algorithms AdaRound and AdaQuant. There is no explicit rounding in BF16 quantization, so only AdaQuant can be used.

quant_config = QuantizationConfig(calibrate_method=CalibrationMethod.MinMax,
                                  quant_format=VitisQuantFormat.QDQ,
                                  activation_type=VitisQuantType.QBFloat16,
                                  weight_type=VitisQuantType.QBFloat16,
                                  extra_options={
                                      'FastFinetune': {
                                          'NumIterations': 1000,
                                          'LearningRate': 1e-6,
                                          'OptimAlgorithm': 'adaquant',
                                          'OptimDevice': 'cpu',
                                          'InferDevice': 'cpu',
                                      }
                                  }
                                 )

License#

Copyright (C) 2024, Advanced Micro Devices, Inc. All rights reserved. SPDX-License-Identifier: MIT