Memory and Disk Friendly Settings#
This page summarizes the settings and recent pipeline improvements that help large ONNX quantization jobs fit within limited memory or temporary-disk budgets. Use these settings when the default workflow is correct but too expensive for your current machine.
Recent pipeline improvements#
Recent Quark ONNX updates significantly reduced peak RSS for both MinMax
and, especially, LayerwisePercentile calibration. In internal benchmarking,
some activation-heavy workloads previously pushed LayerwisePercentile into
very large peak RSS territory. After the optimization work, those same
workloads fit within typical workstation memory budgets.
The main changes are:
Chunked processing for layerwise percentile calibration, replacing bulk loading with small-chunk iteration and buffer reuse.
Selective Calibration Propagation (SCP), where passthrough ops inherit calibration ranges instead of recomputing them.
Improved disk caching that preserves original tensor precision and avoids hidden memory inflation after reload.
Earlier release of intermediate resources so the high-water mark drops as soon as large temporary objects are no longer needed.
In practice, this means calibration method choice is much more accuracy-driven than memory-driven for most models.
Calibration-focused settings#
CalibOptimizeMemreduces calibration memory pressure and is a strong default for large models or large calibration datasets.CalibOptimizeDiskis particularly useful withCalibMethod.LayerwisePercentilebecause it avoids caching intermediate activation tensors in memory or on disk, recomputing them only when needed.CalibPassthroughOpTypesenables Selective Calibration Propagation for distribution-preserving ops such asReshape,Transpose,MaxPool,Split,Slice,Squeeze,Unsqueeze, andGather.CalibWorkerNumcan reduce calibration runtime, but higher values also raise memory pressure because each worker needs its own resources.TmpDirmoves temporary calibration files to a directory with more available space when the default temp filesystem is too small.
Fast finetuning memory settings#
mem_opt_levelin AdaRound and AdaQuant controls how aggressively intermediate data is cached.0is fastest and uses the most memory,1is the default balance, and2minimizes memory at the cost of more disk activity and longer runtimes.If
mem_opt_level = 2is too slow, combine it with data-loader tuning such asnum_workersandpin_memoryto recover some throughput.On supported NVIDIA environments,
use_gds = Truecan improve the disk-backedmem_opt_level = 2finetuning path by reducing storage-to-GPU transfer overhead. See Speed Up Fast Finetuning with GDS.
Practical guidance#
If a model previously failed with
LayerwisePercentilebecause of memory, retry it on the current pipeline before falling back to a less accurate calibration method.Use
CalibOptimizeMemfirst when calibration hits memory limits.Keep
CalibOptimizeDiskenabled forLayerwisePercentileunless disk activity is the primary bottleneck.Set
CalibPassthroughOpTypesto the recommended passthrough-op list to avoid redundant calibration work.Keep
CalibWorkerNummoderate on memory-constrained machines.Use
TmpDirwhen the default/tmplocation is too small for cached files.Use
mem_opt_level = 2only when the lower memory footprint matters more than raw speed.
Takeaway#
With the recent ONNX calibration optimizations, peak RSS for MinMax and
LayerwisePercentile is now much closer on most models. In practice, choose
the calibration method for accuracy first, then use the settings above if you
still need to reduce memory or disk pressure further.