Introduction#

In this tutorial, we will learn how to use BFP16 (Block Floating Point 16) quantization.

What is BFP16 Quantization?#

BFP16 (Block Floating Point 16) quantization is a technique that represents tensors using a block floating-point format, where multiple numbers share a common exponent. This format can provide a balance between dynamic range and precision while using fewer bits than standard floating-point representations. BFP16 quantization aims to reduce the computational complexity and memory footprint of neural networks, making them more efficient for inference on various hardware platforms, particularly those with limited resources.

Key Concepts#

  1. Block Floating Point Format: In BFP16 quantization, data is grouped into blocks, and each block shares a common exponent. This reduces the storage requirements while preserving a sufficient dynamic range for most neural network operations. It differs from standard floating-point formats, which assign an individual exponent to each number.

  2. Dynamic Range and Precision: By using a shared exponent for each block, BFP16 can achieve a balance between range and precision. It allows for more flexible representation of values compared to fixed-point formats and can adapt to the magnitude of the data within each block.

  3. Reduced Computation Costs: BFP16 quantization reduces the number of bits required to represent each tensor element, leading to lower memory usage and faster computations. This is particularly useful for deploying models on devices with limited hardware resources.

  4. Compatibility with Mixed Precision: BFP16 can be combined with other quantization methods, such as mixed precision quantization, to optimize neural network performance further. This compatibility allows for flexible deployment strategies tailored to specific accuracy and performance requirements.

Benefits of BFP16 Quantization#

  1. Improved Efficiency: BFP16 quantization significantly reduces the number of bits needed to represent tensor values, leading to reduced memory bandwidth and faster computation times. This makes it ideal for resource-constrained environments.

  2. Maintained Accuracy: By balancing dynamic range and precision, BFP16 quantization minimizes the accuracy loss that can occur with more aggressive quantization methods.

  3. Hardware Compatibility: BFP16 is well-supported by modern hardware accelerators, making it a flexible and efficient choice for large-scale neural network training and deployment.

How to enable BFP16 quantization in Quark for ONNX?#

Here is a simple example of how to enable BFP16 quantization in Quark for ONNX.

from quark.onnx import ModelQuantizer, VitisQuantType, VitisQuantFormat
from onnxruntime.quantization.calibrate import CalibrationMethod
from quark.onnx.quantization.config.config import Config, QuantizationConfig

quant_config = QuantizationConfig(
    calibrate_method=CalibrationMethod.MinMax,
    quant_format=quark.onnx.VitisQuantFormat.BFPFixNeuron,
    activation_type=quark.onnx.VitisQuantType.QBFP,
    weight_type=quark.onnx.VitisQuantType.QBFP,
)
config = Config(global_quant_config=quant_config)

Note : When inference with ONNX Runtime, we need to register the custom op’s so(Linux) or dll(Windows) file in the ORT session options.

import onnxruntime
from quark.onnx import get_library_path as vai_lib_path

# Also We can use the GPU configuration:
# device='cuda:0'
# providers = ['CUDAExecutionProvider']

device = 'cpu'
providers = ['CPUExecutionProvider']

sess_options = onnxruntime.SessionOptions()
sess_options.register_custom_ops_library(vai_lib_path(device))
session = onnxruntime.InferenceSession(onnx_model_path, sess_options, providers=providers)

How to further improve the accuracy of a BFP16 quantized model in Quark for ONNX?#

If you want to further improve the effectiveness of BFP16 quantization after applying it, you can use fast_finetune to enhance the quantization accuracy. Please refer to this link for more details on how to enable BFP16 Quantization in the configuration of Quark for ONNX. This is a simple example code.

from quark.onnx import ModelQuantizer, VitisQuantFormat, VitisQuantType
from onnxruntime.quantization.calibrate import CalibrationMethod
from quark.onnx.quantization.config.config import Config, QuantizationConfig

quant_config = QuantizationConfig(
    calibrate_method=CalibrationMethod.MinMax,
    quant_format=quark.onnx.VitisQuantFormat.BFPFixNeuron,
    activation_type=quark.onnx.VitisQuantType.QBFP,
    weight_type=quark.onnx.VitisQuantType.QBFP,
    include_fast_ft=True,
    extra_options={
        'FastFinetune': {
                           'DataSize': 100,
                           'FixedSeed': 1705472343,
                           'BatchSize': 5,
                           'NumIterations': 100,
                           'LearningRate': 0.000001,
                           'OptimAlgorithm': 'adaquant',
                           'OptimDevice': 'cpu',
                           'InferDevice': 'cpu',
                           'EarlyStop': True,
                        }
)
config = Config(global_quant_config=quant_config)

Note : You can install onnxruntime-gpu instead of onnxruntime to accelerate inference speed. The BFP QuantType only supports fast_finetune with AdaQuant, not AdaRound. Set ‘InferDevice’ to ‘cuda:0’ to use the GPU for inference. Additionally, set ‘OptimDevice’ to ‘cuda:0’ to accelerate fast_finetune training with the GPU.

Examples#

Here is an example of quantizing a densenet121.ra_in1k model using the BFP16 quantization provided in Quark for ONNX in examples/onnx/accuracy_improvement/BFP/README.

License#

Copyright (C) 2024, Advanced Micro Devices, Inc. All rights reserved. SPDX-License-Identifier: MIT