Speed Up Fast Finetuning with GDS#
GPU Direct Storage (GDS) can reduce Quark ONNX fast finetuning latency by
moving data directly between storage and GPU memory. This is most useful for
AdaRound and AdaQuant runs where data movement becomes the bottleneck.
Because of current software support constraints, this workflow is available only on supported NVIDIA GPU environments.
Background#
Fast finetuning algorithms such as AdaRound and AdaQuant can recover
quantization accuracy, but they are often one of the most time-consuming parts
of the ONNX quantization pipeline.
When mem_opt_level = 2 is used, fast finetuning reduces memory usage by
storing intermediate layer data on disk and loading it on demand. This lowers
peak memory, but it can also make storage-to-GPU data transfer the limiting
factor for end-to-end throughput.
Identify the bottleneck#
Two simple ways to reason about the bottleneck are:
Roofline thinking: If the workflow is limited by memory bandwidth or data movement rather than arithmetic throughput, improving the input pipeline can matter more than adding compute.
GPU utilization: If GPU compute utilization stays low while the job is actively reading data, the workflow is likely input-pipeline-bound rather than compute-bound.
In these cases, GDS can be more effective than purely compute-side tuning.
How GDS helps#
Without GDS, data typically moves through a CPU-mediated path:
storage -> host memory -> CPU-managed pipeline -> GPU memory
With GDS, the transfer path becomes more direct:
storage -> GPU memory
This reduces extra memory copies and CPU overhead, which can:
lower input latency,
improve GPU utilization,
and reduce the performance penalty of
mem_opt_level = 2.
Requirements#
To enable GDS in Quark ONNX fast finetuning:
use a supported NVIDIA GPU environment,
set
optim_deviceto CUDA,set
mem_opt_level = 2,and enable
use_gds = Truein the finetuning algorithm config.
For software installation and environment requirements, refer to the NVIDIA documentation: NVIDIA DALI installation guide.
Example#
from quark.onnx import AdaRoundConfig, Int8Spec, QConfig, QLayerConfig
adaround_algo = AdaRoundConfig(
learning_rate=0.1,
num_iterations=100,
mem_opt_level=2,
optim_device="cuda:0",
use_gds=True,
)
quant_config = QConfig(
global_config=QLayerConfig(
input_tensors=Int8Spec(),
weight=Int8Spec(),
),
algo_config=[adaround_algo],
)
Practical guidance#
Try GDS when fast finetuning is limited by data movement rather than GPU compute.
GDS is most relevant when
mem_opt_level = 2is already needed for memory reasons.If the workflow is already compute-bound, GDS may provide little benefit.
GDS changes the data path only; it does not change the quantization algorithm or introduce an accuracy trade-off by itself.
On unsupported hardware, keep
use_gds = Falseand tune settings such asnum_workersandpin_memoryinstead.For a summary of all throughput-related settings including
num_workers,pin_memory, anduse_gds, see Accelerate with Settings.