File-to-File LLM Quantization

File-to-File LLM Quantization#

Challenges in Quantizing Ultra-Large Models#

For quantization schemes that operate independently on each tensor — such as weight-only quantization (FP8, MXFP4, INT4, etc.) or dynamic activation quantization + weight quantization — standard tools still require loading the entire model into GPU memory. For a 600B+ parameter model this means hundreds of gigabytes, leading to Out of Memory (OOM) failures even on multi-GPU nodes.

This is wasteful: these quantization schemes are per-tensor operations that do not require the full model graph. Each tensor can be quantized independently.

The Solution: File-to-File Quantization, No Full Model Loading#

File-to-File Quantization exploits this independence by reading each .safetensors file, quantizing the weights it contains, and writing the result directly to a new file — without ever loading the full model into memory.

Standard vs. File-to-File quantization#
Item	Standard quantization	File-to-file quantization
What gets loaded	The entire model (all files / full state dict)	One file at a time (plus optional recovery metadata)
Core operation	Per-tensor weight quantization	Per-tensor weight quantization (same math)
Peak memory driver	Full model size (often 100s of GB for 600B+)	Largest single file (typically ~5–10 GB)
Pre-quantized input handling	Typically assumes BF16/FP16 inputs	Can recover then re-quantize (e.g., FP8 or compressed-tensors)
Output	Quantized weights + exported artifacts	Quantized weights + exported artifacts

Both approaches produce the same per-tensor quantization results. File-to-file mode supports weight-only quantization (e.g., FP8, MXFP4, INT4) and dynamic activation quantization + weight quantization, where each tensor can be processed independently — no calibration data, forward pass, or full model graph is required. The only difference is how tensors are loaded: the standard approach loads the entire model at once, while file-to-file reads and writes one .safetensors file at a time.

Supported Input Formats#

Format	Quant Method	Description	Dependencies
BF16 / FP16	—	Standard HuggingFace model, loaded directly	—
FP8	`fp8`	FP8 weights with `_scale_inv` tensors; dequantized before re-quantization	Triton
compressed-tensors	`compressed-tensors`	HuggingFace-style packed weights; decompressed before re-quantization	compressed-tensors

Usage Examples#

Example 1: File-to-File Quantization via Python API#

This approach mirrors the _build_quant_config helper in quantize_quark.py — it uses LLMTemplate to auto-detect the model architecture and build the quantization config from a scheme name, avoiding manual QConfig construction.

import json
from quark.torch import LLMTemplate, ModelQuantizer

model_path = "/path/to/model"
save_path = "/path/to/output"

# Read model type from config.json
with open(f"{model_path}/config.json") as f:
    model_type = json.load(f)["model_type"]

# Build quant config
template = LLMTemplate.get(model_type)
quant_config = template.get_config(
    scheme="mxfp4",                              # or "fp8", "int4_wo_128", etc.
    exclude_layers=["*self_attn*", "*lm_head"],
)

# Run file-to-file quantization
quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
    pretrained_model_path=model_path,
    save_path=save_path,
)

Example 2: File-to-File Mode via `quantize_quark.py`#

The quantize_quark.py script supports a --file2file_quantization flag that bypasses model loading and the standard quantization pipeline. It reads config.json to determine the model type, builds quant_config using the same LLMTemplate / --quant_scheme mechanism, and directly quantizes safetensors files in a file-to-file manner.

python examples/torch/language_modeling/llm_ptq/quantize_quark.py \
    --model_dir /path/to/model \
    --quant_scheme mxfp4 \
    --exclude_layers "*self_attn*" "*mlp.gate" "*mlp.gate.linear" "*lm_head" \
    --output_dir /path/to/output \
    --file2file_quantization \
    --skip_evaluation

API Reference: `direct_quantize_checkpoint`#

ModelQuantizer.direct_quantize_checkpoint(
    pretrained_model_path,
    save_path,
    keep_excluded_layers_as_original_model_state=False,
    weight_converters=None,
    device=None,
    presharded_weights=None,
)

Parameter	Default	Description
`pretrained_model_path`	(required)	Path to the pretrained model directory containing `.safetensors` files.
`save_path`	(required)	Directory path to save the quantized `.safetensors` files and all auxiliary files (`config.json`, `model.safetensors.index.json`, tokenizer files, etc.).
`keep_excluded_layers_as_original_model_state`	`False`	If `True`, excluded layers that are already quantized in the source checkpoint keep their original model-state format in the output. Useful when the source is a pre-quantized checkpoint and certain layers should pass through unchanged.
`weight_converters`	`None`	Optional list of `WeightConverter` instances to transform tensors after precision recovery and before quantization. See Weight Transformation with WeightConverter below.
`device`	`None`	Device for tensor operations (e.g. `"cuda"`, `"cuda:0"`, `"cpu"`). Defaults to `"cuda"` when `None`.
`presharded_weights`	`None`	Optional pre-loaded shard dictionary. When provided, the method uses these tensors directly instead of reading from `pretrained_model_path`.

Weight Transformation with WeightConverter#

WeightConverter allows post-quantization weight transformations on each .safetensors shard as it passes through the file-to-file pipeline. Converters run after precision recovery (dequantization of pre-quantized inputs) and before Quark quantization.

Use cases:

Splitting a fused projection (e.g. gate_up_proj) into separate tensors (gate_proj, up_proj) that Quark quantizes individually.
Renaming tensor keys to match a different model variant’s naming convention.
Applying custom packing or layout transformations after quantization.

Each converter receives the tensor dictionary for one .safetensors shard and returns a modified dictionary. Converters in the list are applied in order.

Example — split fused gate/up projection:

from quark.torch import LLMTemplate, ModelQuantizer
from quark.torch.quantization.weight_convert import Chunk, WeightConverter

template = LLMTemplate.get("qwen")
quant_config = template.get_config(scheme="mxfp4")

weight_converters = [
    WeightConverter(
        "gate_up_proj.weight",
        ["gate_proj.weight", "up_proj.weight"],
        operations=[Chunk(dim=0)],
    ),
]

quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
    pretrained_model_path="/path/to/model",
    save_path="/path/to/output",
    weight_converters=weight_converters,
)

Note

WeightConverter is suited for post-recovery single-suffix rename or one-source split operations. It is not suited for multi-source merges, cross-shard scale pairing, or dtype conversions such as FP4 → FP8. For those cases, generate a normalization script that pre-processes the checkpoint before running file-to-file quantization.

Progressive (Two-Step) Quantization#

Progressive quantization applies quantization in two sequential stages within a single file-to-file pass:

The weight is first quantized to an intermediate format (e.g. FP8 E4M3 per-tensor).
The intermediate result is then re-quantized to the final target format (e.g. INT4 per-channel).

This produces better accuracy than direct single-step quantization for aggressive targets such as INT4, because the first stage preserves more information than the final format alone.

Built-in progressive scheme — ``int4_fp8``:

The int4_fp8 scheme (W4A8) uses progressive quantization internally:

Weight stage 1: FP8 E4M3 per-tensor static quantization.
Weight stage 2: INT4 per-channel static quantization of the FP8 result.
Activation: FP8 E4M3 per-tensor dynamic quantization.

This matches the AMD-Quark W4A8 recipe used by models such as amd/Kimi-K2-Thinking-W4A8.

import json
from quark.torch import LLMTemplate, ModelQuantizer

model_path = "/path/to/model"
save_path = "/path/to/output"

with open(f"{model_path}/config.json") as f:
    model_type = json.load(f)["model_type"]

template = LLMTemplate.get(model_type)
quant_config = template.get_config(scheme="int4_fp8")   # progressive W4A8

quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
    pretrained_model_path=model_path,
    save_path=save_path,
)

python examples/torch/language_modeling/llm_ptq/quantize_quark.py \
    --model_dir /path/to/model \
    --quant_scheme int4_fp8 \
    --file2file_quantization \
    --output_dir /path/to/output \
    --skip_evaluation

Custom progressive scheme via ``ProgressiveSpec``:

from quark.torch.quantization.config.config import (
    QConfig, QLayerConfig, ProgressiveSpec,
    FP8E4M3PerTensorSpec, Int4PerChannelSpec,
)
from quark.torch import ModelQuantizer

weight_spec = ProgressiveSpec(
    first_stage=FP8E4M3PerTensorSpec(
        observer_method="min_max", scale_type="float32", is_dynamic=False
    ),
    second_stage=Int4PerChannelSpec(
        symmetric=True, scale_type="float32",
        round_method="half_even", ch_axis=0, is_dynamic=False
    ),
).to_quantization_spec()

quant_config = QConfig(global_quant_config=QLayerConfig(weight=weight_spec))
quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
    pretrained_model_path="/path/to/model",
    save_path="/path/to/output",
)

Output tensor naming for progressive quantization:

Progressive quantization writes three tensors per weight:

Tensor key	Contents
`<layer>.weight`	Packed weight from stage 2 (e.g. INT4 values).
`<layer>.weight_scale`	Scale from stage 1 (e.g. FP8 per-tensor scale).
`<layer>.weight_scale_2`	Scale from stage 2 (e.g. INT4 per-channel scale).

Requirements#

PyTorch
safetensors
Triton (optional, only needed for FP8 input format dequantization)
compressed-tensors (optional, only needed for compressed-tensors input format)