LayerwisePercentile#

CalibMethod.LayerwisePercentile is an accuracy-oriented calibration method that searches percentile candidates per layer instead of applying one fixed percentile everywhere. It is useful when a model is sensitive to activation range selection and default calibration methods do not produce stable results.

Recent memory improvements#

Recent Quark ONNX pipeline updates significantly reduced the peak RSS memory footprint of LayerwisePercentile. In internal benchmarking, some activation-heavy workloads saw roughly 20x to 66x lower peak RSS, turning LayerwisePercentile from a method that could previously require extremely large memory budgets into one that fits standard workstation-class hardware.

The main changes are:

  • Chunked processing during layerwise percentile calibration, replacing bulk loading of all calibration data with small-chunk iteration and buffer reuse.

  • Selective Calibration Propagation (SCP), where distribution-preserving ops such as Reshape and Transpose inherit calibration ranges instead of recomputing them.

  • Improved disk caching that preserves original tensor precision and avoids hidden memory growth caused by dtype promotion on reload.

  • Earlier release of intermediate resources, including inference sessions and other large temporary objects.

When to use it#

  • Try it when MinMax or a single Percentile setting produces noticeable accuracy degradation.

  • It is especially useful for models where different layers need different clipping behavior.

  • With the recent memory reductions, it is now practical to evaluate LayerwisePercentile on standard hardware instead of ruling it out only because of historical memory cost.

  • It can require more calibration work than simpler methods, so expect a runtime trade-off in exchange for better calibration quality.

Key settings#

  • Set calibration_method to CalibMethod.LayerwisePercentile in the tensor config you want to calibrate.

  • Use LWPMetric to choose the metric used to compare percentile candidates.

  • Use PercentileCandidates to control which percentile values are explored during the layerwise search.

  • Use CalibWorkerNum to parallelize calibration when you have enough CPU and memory headroom.

  • Use CalibPassthroughOpTypes to enable Selective Calibration Propagation on distribution-preserving ops. The recommended list is ["Reshape", "Transpose", "MaxPool", "Split", "Slice", "Squeeze", "Unsqueeze", "Gather"].

Memory and disk behavior#

  • CalibOptimizeMem reduces calibration memory pressure and is a good default when activation caching becomes expensive.

  • CalibOptimizeDisk is specific to LayerwisePercentile and avoids retaining intermediate activation tensors in memory or on disk, recomputing them instead when needed.

  • CalibPassthroughOpTypes reduces redundant calibration work on distribution-preserving ops and can further lower peak RSS.

  • TmpDir lets you move temporary calibration files away from a small default system temp directory when disk space is limited.

Takeaway#

For most models, you can now choose LayerwisePercentile based on accuracy needs rather than assuming it will exceed available memory. Start with the default memory-saving settings, then tune percentile candidates and worker count as needed.