LayerwisePercentile#
CalibMethod.LayerwisePercentile is an accuracy-oriented calibration method
that searches percentile candidates per layer instead of applying one fixed
percentile everywhere. It is useful when a model is sensitive to activation
range selection and default calibration methods do not produce stable results.
Recent memory improvements#
Recent Quark ONNX pipeline updates significantly reduced the peak RSS memory
footprint of LayerwisePercentile. In internal benchmarking, some
activation-heavy workloads saw roughly 20x to 66x lower peak RSS, turning
LayerwisePercentile from a method that could previously require extremely
large memory budgets into one that fits standard workstation-class hardware.
The main changes are:
Chunked processing during layerwise percentile calibration, replacing bulk loading of all calibration data with small-chunk iteration and buffer reuse.
Selective Calibration Propagation (SCP), where distribution-preserving ops such as
ReshapeandTransposeinherit calibration ranges instead of recomputing them.Improved disk caching that preserves original tensor precision and avoids hidden memory growth caused by dtype promotion on reload.
Earlier release of intermediate resources, including inference sessions and other large temporary objects.
When to use it#
Try it when
MinMaxor a singlePercentilesetting produces noticeable accuracy degradation.It is especially useful for models where different layers need different clipping behavior.
With the recent memory reductions, it is now practical to evaluate
LayerwisePercentileon standard hardware instead of ruling it out only because of historical memory cost.It can require more calibration work than simpler methods, so expect a runtime trade-off in exchange for better calibration quality.
Key settings#
Set
calibration_methodtoCalibMethod.LayerwisePercentilein the tensor config you want to calibrate.Use
LWPMetricto choose the metric used to compare percentile candidates.Use
PercentileCandidatesto control which percentile values are explored during the layerwise search.Use
CalibWorkerNumto parallelize calibration when you have enough CPU and memory headroom.Use
CalibPassthroughOpTypesto enable Selective Calibration Propagation on distribution-preserving ops. The recommended list is["Reshape", "Transpose", "MaxPool", "Split", "Slice", "Squeeze", "Unsqueeze", "Gather"].
Memory and disk behavior#
CalibOptimizeMemreduces calibration memory pressure and is a good default when activation caching becomes expensive.CalibOptimizeDiskis specific toLayerwisePercentileand avoids retaining intermediate activation tensors in memory or on disk, recomputing them instead when needed.CalibPassthroughOpTypesreduces redundant calibration work on distribution-preserving ops and can further lower peak RSS.TmpDirlets you move temporary calibration files away from a small default system temp directory when disk space is limited.
Takeaway#
For most models, you can now choose LayerwisePercentile based on accuracy
needs rather than assuming it will exceed available memory. Start with the
default memory-saving settings, then tune percentile candidates and worker
count as needed.