Accelerate with Settings#
This page summarizes the settings that most directly affect ONNX quantization throughput. Use it together with Accelerate with GPUs when you want faster calibration or fast finetuning without changing the model itself.
Calibration throughput settings#
ExecutionProviderscontrols where calibration runs. Use GPU providers such asROCMExecutionProviderorCUDAExecutionProviderwhen available to accelerate ONNX Runtime execution.CalibWorkerNumcontrols how many workers collect calibration data. Higher values can reduce runtime, but they also increase memory usage.
Fast finetuning throughput settings#
mem_opt_levelcontrols the memory-versus-speed trade-off in AdaRound and AdaQuant workflows. Lower values are faster; higher values reduce memory consumption.num_workersincreases host-side data loading parallelism when fast finetuning.pin_memorycan improve host-to-device transfer throughput when the data pipeline is feeding a GPU.use_gdsenables GPU Direct Storage for supported NVIDIA-based setups whenoptim_deviceis CUDA andmem_opt_levelis 2.
GDS for fast finetuning#
GPU Direct Storage (GDS) can reduce fast finetuning latency when the workflow
becomes limited by disk-to-GPU data movement instead of GPU compute. This is
most relevant for AdaRound and AdaQuant runs that use
mem_opt_level = 2, where layer data is stored on disk to lower memory
usage.
GDS changes the data path rather than the algorithm itself, so it can improve throughput without changing quantization behavior or accuracy targets.
Use GDS only on supported NVIDIA GPU environments.
Set
use_gds = Truein the finetuning algorithm config.GDS requires
optim_deviceto use CUDA and works withmem_opt_level = 2.For setup requirements, bottleneck analysis, and an example configuration, see Speed Up Fast Finetuning with GDS.
Practical guidance#
Start with a moderate
CalibWorkerNumand increase it only while memory usage remains healthy.If fast finetuning becomes input-pipeline bound, raise
num_workersand enablepin_memorybefore increasing algorithm iterations.Use
mem_opt_level = 1when you want a balanced default. Switch tomem_opt_level = 2when memory pressure is the limiting factor.