Quark ONNX Quantization Tutorial For GPTQ#

NOTE This tutorial can be downloaded for local execution on a Jupyter Notebook environment. Click here to download the source file.

GPTQ (Gradient Post-Training Quantization) is an advanced post-training quantization method designed to minimize accuracy degradation by applying a layer-wise error compensation mechanism during quantization. Instead of requiring access to the full training dataset, GPTQ uses a small calibration set to estimate quantization impact and optimize weight adjustments. This makes it particularly effective for large language models and other architectures sensitive to quantization noise.

By applying GPTQ method within the AMD Quark toolchain, developers can efficiently deploy highly optimized quantized models with improved accuracy retention, reduced compute demand, and enhanced performance across AMD’s hardware ecosystem.

The example has the following parts:

Install requirements
Prepare model
Prepare dataset
Quantizatize models
- INT8 only
- INT8 and GPTQ
- INT8 and MatMul NBits
Evaluate Models

1) Install The Necessary Python Packages:#

In addition to Quark that must be installed as documented at here, extra packages are require for this tutorial.

%pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
%pip install amd-quark
%pip install -r ./requirements.txt

2) Export ONNX Model From OPT-125M Model.#

Let’s download necessary files and put it in one folder named opt-125m.

!mkdir opt-125m
!wget -P opt-125m https://huggingface.co/facebook/opt-125m/resolve/main/pytorch_model.bin
!wget -P opt-125m https://huggingface.co/facebook/opt-125m/resolve/main/config.json
!wget -P opt-125m https://huggingface.co/facebook/opt-125m/resolve/main/tokenizer_config.json
!wget -P opt-125m https://huggingface.co/facebook/opt-125m/resolve/main/vocab.json
!wget -P opt-125m https://huggingface.co/facebook/opt-125m/resolve/main/merges.txt
!wget -P opt-125m https://huggingface.co/facebook/opt-125m/resolve/main/generation_config.json
!wget -P opt-125m https://huggingface.co/facebook/opt-125m/resolve/main/special_tokens_map.json

Now create a folder “models” and convert opt-125m to the onnx format into the “models” folder.

!mkdir -p models
!optimum-cli export onnx --model ./opt-125m --task text-generation ./models/

Import all the dependencies

import copy
import logging
import os
import random
from typing import Any, Union

import numpy as np
import torch
from datasets import load_dataset
from torch.utils.data import DataLoader, Dataset, SequentialSampler
from tqdm import tqdm
from transformers import AutoTokenizer, GPT2Tokenizer, PreTrainedTokenizer

from quark.onnx import Config, ModelQuantizer
from quark.onnx.quantization.config import get_default_config

3) Prepare Dataset#

We provide a dataloader that supports three commonly used datasets: Pileval, CNN DailyMail, and WikiText. In this tutorial, we will be using the WikiText dataset as an example, but you are encouraged to experiment with the others to evaluate GPTQ’s effectiveness across different data domains.

The WikiText-2 dataset is a widely used benchmark for evaluating language models. It consists of high-quality Wikipedia text curated to better reflect natural language usage compared to earlier corpora. A key feature is the preservation of article structure, such as headings and paragraph organization—information often lost in simpler datasets. This structure helps language models learn long-range dependencies more effectively.

def get_calib_dataloader(
    dataset_name: str, **kwargs: Any
) -> Union[DataLoader[torch.Tensor], DataLoader[list[dict[str, torch.Tensor]]], DataLoader[dict[str, torch.Tensor]]]:
    if dataset_name in ["pileval", "cnn_dailymail"]:
        return get_calib_dataloader_to_tensor(dataset_name, **kwargs)
    elif dataset_name in ["pileval_for_awq_benchmark", "wikitext_for_gptq_benchmark"]:
        return get_calib_dataloader_to_list(dataset_name, **kwargs)
    else:
        raise NotImplementedError


def get_pileval(
    tokenizer: PreTrainedTokenizer, nsamples: int, seqlen: int, device: str | None, seed: int = 0
) -> list[dict[str, torch.Tensor]]:
    dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation", cache_dir="data_cache")
    dataset = dataset.shuffle(seed=seed)
    samples = []
    n_run = 0
    for data in dataset:
        line = data["text"]
        line = line.strip()
        line_encoded = tokenizer.encode(line)
        if len(line_encoded) > 512:
            continue
        sample = torch.tensor([line_encoded])
        if sample.numel() == 0:
            continue
        sample = sample.to(device)
        samples.append(sample)
        n_run += 1
        if n_run == nsamples:
            break
    # now concatenate all samples and split according to block size
    cat_samples = torch.cat(samples, dim=1)
    n_split = cat_samples.shape[1] // seqlen
    logging.debug(f" * Split into {n_split} blocks")
    traindataset = []
    for i in range(n_split):
        traindataset.append({"input_ids": cat_samples[:, i * seqlen : (i + 1) * seqlen]})
    return traindataset


def get_wikitext2(
    tokenizer: PreTrainedTokenizer, nsamples: int, seqlen: int, device: str | None, seed: int = 0
) -> list[dict[str, torch.Tensor]]:
    traindata = load_dataset("wikitext", "wikitext-2-raw-v1", split="train", cache_dir="data_cache")
    trainenc = tokenizer("\n\n".join(traindata["text"]), return_tensors="pt")
    trainenc = trainenc.to(device)

    import random

    random.seed(seed)
    torch.random.manual_seed(seed)

    traindataset = []
    for _ in range(nsamples):
        i = random.randint(0, trainenc.input_ids.shape[1] - seqlen - 1)
        j = i + seqlen
        inp = trainenc.input_ids[:, i:j]
        attention_mask = torch.ones_like(inp)
        traindataset.append({"input_ids": inp, "attention_mask": attention_mask})
    return traindataset


def get_calib_dataloader_to_list(
    dataset_name: str = "pileval_for_awq_benchmark",
    tokenizer: AutoTokenizer = None,
    batch_size: int = 1,
    num_calib_data: int = 128,
    seqlen: int = 2048,
    device: str = "cpu",
) -> DataLoader[list[dict[str, torch.Tensor]]]:
    if dataset_name == "pileval_for_awq_benchmark":
        samples = get_pileval(tokenizer, num_calib_data, seqlen, device, seed=42)
    elif dataset_name == "wikitext_for_gptq_benchmark":
        samples = get_wikitext2(tokenizer, num_calib_data, seqlen, device)
    else:
        raise NotImplementedError

    calib_dataloader: DataLoader[list[dict[str, torch.Tensor]]] = DataLoader(samples, batch_size=None, shuffle=False)  # type: ignore

    return calib_dataloader


def get_calib_dataloader_to_tensor(
    dataset_name: str = "cnn_dailymail",
    tokenizer: AutoTokenizer = None,
    batch_size: int = 1,
    num_calib_data: int = 512,
    seqlen: int = 512,
    device: str | None = None,
) -> DataLoader[torch.Tensor]:
    if dataset_name == "pileval":
        dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation", cache_dir="data_cache")
        text_data = dataset["text"][:num_calib_data]
    elif dataset_name == "cnn_dailymail":
        dataset = load_dataset("cnn_dailymail", name="3.0.0", split="train", cache_dir="data_cache")
        text_data = dataset["article"][:num_calib_data]
    elif dataset_name == "wikitext":
        dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train", cache_dir="data_cache")
        text_data = dataset["text"][:num_calib_data]
    else:
        raise NotImplementedError

    batch_encoded = tokenizer(text_data, return_tensors="pt", padding=True, truncation=True, max_length=seqlen)
    if device:
        batch_encoded = batch_encoded.to(device)
    batch_encoded = batch_encoded["input_ids"]

    calib_dataloader = DataLoader(batch_encoded, batch_size=batch_size, shuffle=False)

    return calib_dataloader


def get_calib_dataloader_to_dict(
    dataset_name: str = "cnn_dailymail",
    tokenizer: AutoTokenizer = None,
    batch_size: int = 1,
    num_calib_data: int = 512,
    seqlen: int = 512,
    device: str | None = None,
) -> DataLoader[dict[str, torch.Tensor]]:
    def make_data_block(
        examples: dict[str, list[str]],
        tokenizer: AutoTokenizer = None,
        prompt_col_name: str = "",
        max_length: int = 512,
    ) -> dict[str, list[list[torch.Tensor]]]:
        res: dict[str, list[list[torch.Tensor]]] = tokenizer(
            examples[prompt_col_name], padding=True, truncation=True, max_length=max_length
        )
        return res

    def my_collate_fn(blocks: list[dict[str, list[list[str]]]]) -> dict[str, torch.Tensor]:
        data_batch = {}
        data_batch["input_ids"] = torch.Tensor([block["input_ids"] for block in blocks])
        if device:
            data_batch["input_ids"] = data_batch["input_ids"].to(device)
        return data_batch

    if dataset_name == "pileval":
        dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation", cache_dir="data_cache")
        prompt_col_name = "text"
    elif dataset_name == "cnn_dailymail":
        dataset = load_dataset("cnn_dailymail", name="3.0.0", split="train", cache_dir="data_cache")
        prompt_col_name = "article"
    elif dataset_name == "wikitext":
        dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train", cache_dir="data_cache")
        prompt_col_name = "text"
    else:
        raise NotImplementedError

    dataset = dataset.select(
        indices=[i for i in range(min(len(dataset), num_calib_data))],
        keep_in_memory=True,
    )
    tokenized_datasets = dataset.map(
        make_data_block,
        batched=True,
        batch_size=len(dataset),
        num_proc=1,
        remove_columns=dataset.column_names,
        keep_in_memory=True,
        fn_kwargs={"tokenizer": tokenizer, "prompt_col_name": prompt_col_name, "max_length": seqlen},
    )

    calib_dataloader = DataLoader(tokenized_datasets, batch_size=batch_size, collate_fn=my_collate_fn)

    return calib_dataloader

Let’s create a data reader to load the target dataset.

class CalibrationDataReader:
    def __init__(self, dataloader):
        super().__init__()
        self.iterator = iter(dataloader)

    def get_next(self) -> dict:
        try:
            inputs = next(self.iterator)[0]
            input_dict = {}
            input_dict["input_ids"] = inputs.numpy().reshape(1, -1)
            input_dict["attention_mask"] = np.ones_like(inputs.numpy().reshape(1, -1))
            return input_dict
        except StopIteration:
            return None

4) Quantization Procedure#

In this section, we compare three quantization configurations – INT8 baseline, INT8 enhanced with GPTQ, and MalMul-NBits (which is for transformer-based models with 4-bits GPTQ Quant) – to illustrate how GPTQ improves accuracy retention during quantization. These configurations allow us to evaluate the trade-offs between model size, computational efficiency, and overall accuracy when deploying quantized models using AMD Quark.

def quantize_model(args: dict) -> None:
    # `input_model_path` is the path to the original, unquantized ONNX model.
    input_model_path = args["input_model_path"]

    # `output_model_path` is the path where the quantized model will be saved.
    output_model_path = args["output_model_path"]

    tokenizer_class = GPT2Tokenizer
    tokenizer = tokenizer_class.from_pretrained(
        os.path.dirname(input_model_path),
        do_lower_case=False,
        cache_dir=None,
    )

    # `dr` (Data Reader) is an instance of DataReader, which is a utility class that
    # reads the calibration dataset and prepares it for the quantization process.
    calib_dataloader = get_calib_dataloader(
        dataset_name="pileval", tokenizer=tokenizer, batch_size=1, seqlen=512, device=args["device"]
    )
    calib_dataloader = CalibrationDataReader(calib_dataloader)
    # Get quantization configuration
    quant_config = get_default_config(args["config"])
    config_copy = copy.deepcopy(quant_config)
    config_copy.extra_options["OpTypesToExcludeOutputQuantization"] = ["MatMul", "Gemm"]

    if args["config"] == "INT8_TRANSFORMER_DEFAULT":
        config_copy.extra_options["UseGPTQ"] = args["use_gptq"]
    elif args["config"] == "MATMUL_NBITS":
        config_copy.extra_options["MatMulNBitsParams"]["AccuracyLevel"] = 0
        config_copy.extra_options["MatMulNBitsParams"]["Algorithm"] = "GPTQ"
    config_copy.include_cle = False
    config_copy.extra_options["GPTQParams"] = {"MSE": False, "GroupSize": 128, "ActOrder": False, "PerChannel": True}
    config = Config(global_quant_config=config_copy)
    print(f"The configuration for quantization is {config}")

    # Create an ONNX quantizer
    quantizer = ModelQuantizer(config)

    # Quantize the ONNX model
    quantizer.quantize_model(input_model_path, output_model_path, calib_dataloader)

Create a dedicated folder for INT8 baseline to prevent interference with other quantization configurations. Then, define a base config for INT8 and apply quantization.

!rm -rf quantized_models
!cp -r models quantized_models
!rm -f quantized_models/model.onnx

quant_config_int8_only = {
    "input_model_path": "models/model.onnx",
    "output_model_path": "quantized_models/quantized_model.onnx",
    "use_gptq": False,
    "num_calib_data": 1000,
    "config": "INT8_TRANSFORMER_DEFAULT",
    "device": "cpu",
    "batch_size": 1,
    "workers": 1,
}

quantize_model(quant_config_int8_only)

Now try INT8 with GPTQ. Create a dedicated folder to prevent interference with other quantization configurations. Then, define GPTQ config and apply quantization.

!rm -rf gptq_quantized_models
!cp -r models gptq_quantized_models
!rm -f gptq_quantized_models/model.onnx

quant_config_int8_gptq = copy.deepcopy(quant_config_int8_only)
quant_config_int8_gptq["output_model_path"] = "gptq_quantized_models/gptq_quantized_model.onnx"
quant_config_int8_gptq["use_gptq"] = True

quantize_model(quant_config_int8_gptq)

Last, try MalMul-NBits for INT4 with GPTQ. Again, create a dedicated folder to prevent interference with other quantization configurations. Then, define MatMul config and apply quantization.

!rm -rf gptq_MATMUL_NBITS_quantized_models
!cp -r models gptq_MATMUL_NBITS_quantized_models
!rm -f gptq_MATMUL_NBITS_quantized_models/model.onnx

quant_config_matmul = copy.deepcopy(quant_config_int8_only)
quant_config_matmul["output_model_path"] = "gptq_MATMUL_NBITS_quantized_models/gptq_quantized_model.onnx"
quant_config_matmul["config"] = "MATMUL_NBITS"

quantize_model(quant_config_matmul)

5) Evaluation and Expected Results#

Evaluation is performed on the WikiText2 dataset. We compare four models — (1) full-precision, (2) quantized with INT8, (3) quantized with INT8 and GPTQ, and (4) quantized with MatMul (INT4 and GPTQ). The full-precision model serves as the baseline for measuring any accuracy change caused by quantization.

The evaluation metric is Perplexity, which is a standard metric used to assess how well a language model predicts a sequence of words. It effectively measures how “surprised” the model is by the test data: - Low perplexity → model predicts the text well (less surprised) - High perplexity → model struggles to predict the text (more surprised)

You can think of perplexity as the “average branching factor”—how many choices the model is effectively considering at each prediction step.

from transformers import OPTConfig, OPTForCausalLM, PreTrainedTokenizer

WEIGHTS_NAME = "pytorch_model.bin"
logger = logging.getLogger(__name__)

MODEL_CLASSES = {
    "opt": (OPTConfig, OPTForCausalLM, GPT2Tokenizer),
}


class TextDataset(Dataset):
    def __init__(self, tokenizer, args, block_size=512):
        testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
        text = ""
        for i in testdata:
            text += i["text"]
        self.examples = []
        tokenized_text = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(text))
        for i in range(0, len(tokenized_text) - block_size + 1, block_size):  # Truncate in block of block_size
            self.examples.append(tokenizer.build_inputs_with_special_tokens(tokenized_text[i : i + block_size]))

    def __len__(self):
        return len(self.examples)

    def __getitem__(self, item):
        return torch.tensor(self.examples[item])


def load_and_cache_examples(args, tokenizer, evaluate=True):
    dataset = TextDataset(
        tokenizer,
        args,
        block_size=args["block_size"],
    )
    return dataset


def set_seed(args: str) -> None:
    random.seed(args["seed"])
    np.random.seed(args["seed"])
    torch.manual_seed(args["seed"])


def mask_tokens(inputs: torch.Tensor, tokenizer: PreTrainedTokenizer, args) -> tuple[torch.Tensor, torch.Tensor]:
    """Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original."""
    labels = inputs.clone()
    probability_matrix = torch.full(labels.shape, args["mlm_probability"])
    special_tokens_mask = [
        tokenizer.get_special_tokens_mask(val, already_has_special_tokens=True) for val in labels.tolist()
    ]
    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool()
    labels[~masked_indices] = -100  # We only compute loss on masked tokens

    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool() & masked_indices
    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

    # 10% of the time, we replace masked input tokens with random word
    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool() & masked_indices & ~indices_replaced
    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long)
    inputs[indices_random] = random_words[indices_random]

    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
    return inputs, labels


def evaluate_onnx(args, model, tokenizer, prefix=""):
    from torch.nn import CrossEntropyLoss

    # Loop to handle MNLI double evaluation (matched, mis-matched)
    testdata = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
    test_data = ""
    for i in testdata:
        test_data += i["text"]

    eval_dataset = load_and_cache_examples(args, tokenizer, evaluate=True)

    args["eval_batch_size"] = args["per_gpu_eval_batch_size"]
    # Note that DistributedSampler samples randomly
    eval_sampler = SequentialSampler(eval_dataset)
    eval_dataloader = DataLoader(eval_dataset, sampler=eval_sampler, batch_size=args["eval_batch_size"])

    logger.info(f"***** Running evaluation {prefix} *****")
    eval_loss = 0.0
    nb_eval_steps = 0

    for batch in tqdm(eval_dataloader, desc="Evaluating"):
        inputs, labels = (batch, batch)
        with torch.no_grad():
            outputs = model(input_ids=inputs, attention_mask=inputs.new_ones(inputs.shape))

            # Shift so that tokens < n predict n
            lm_logits = outputs[0]
            shift_logits = lm_logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            # Flatten the tokens
            loss_fct = CrossEntropyLoss()
            lm_loss = loss_fct(shift_logits.float().view(-1, shift_logits.size(-1)), shift_labels.view(-1))

            eval_loss += lm_loss.mean().item()
        nb_eval_steps += 1

    eval_loss = eval_loss / nb_eval_steps
    perplexity = torch.exp(torch.tensor(eval_loss))

    result = {"perplexity": perplexity}

    for key in sorted(result.keys()):
        logger.info("  %s = %s", key, str(result[key]))

    return result

def evaluate(args: dict) -> None:
    # Setup logging
    logging.basicConfig(
        format="%(asctime)s - %(levelname)s - %(name)s -   %(message)s",
        datefmt="%m/%d/%Y %H:%M:%S",
        level=logging.INFO,
    )

    # Set seed
    set_seed(args)

    # Load pretrained model and tokenizer
    _, _, tokenizer_class = MODEL_CLASSES[args["model_type"]]

    tokenizer = tokenizer_class.from_pretrained(
        args["tokenizer_name"] if args.get("tokenizer_name") else args["model_name_or_path"],
        do_lower_case=False,
        cache_dir=None,
    )
    tokenizer.add_bos_token = False
    if args["block_size"] <= 0:
        args["block_size"] = (
            tokenizer.max_len_single_sentence
        )  # Our input block size will be the max possible for the model

    # Evaluation
    results = {}

    if args["do_onnx_eval"]:
        logger.info("Evaluate the following onnx model: %s", args["model_name_or_path"])
        global_step = ""
        prefix = "onnx"

        from optimum.onnxruntime import ORTModelForCausalLM

        if args.get("no_cuda"):
            provider = "CPUExecutionProvider"
        else:
            provider = "CUDAExecutionProvider"
        model = ORTModelForCausalLM.from_pretrained(
            args["model_name_or_path"], provider=provider, use_cache=False, use_io_binding=False
        )
        result = evaluate_onnx(args, model, tokenizer, prefix=prefix)
        result = dict((k + f"_{global_step}", v) for k, v in result.items())
        results.update(result)

First, define an evaluation config, and record accuracy of the Full Precision model

eval_config = {
    "model_type": "opt",
    "mlm_probability": 0.15,
    "block_size": 2048,
    "per_gpu_eval_batch_size": 1,
    "no_cuda": True,
    "seed": 42,
    "do_onnx_eval": True,
    "eval_data_file": None,
    "config_name": "",
    "tokenizer_name": "",
}

full_precision_eval_config = copy.deepcopy(eval_config)
full_precision_eval_config["model_name_or_path"] = "models/"
full_precision_eval_config["onnx_model"] = "models/"

evaluate(full_precision_eval_config)

Then, specify path to the INT8 only quantized model and record its accuracy

int8_quant_eval_config = copy.deepcopy(eval_config)
int8_quant_eval_config["model_name_or_path"] = "quantized_models/"
int8_quant_eval_config["onnx_model"] = "quantized_models/"

evaluate(int8_quant_eval_config)

Third, specify path to the INT8 with GPTQ quantized model and record its accuracy

int8_gptq_eval_config = copy.deepcopy(eval_config)
int8_gptq_eval_config["model_name_or_path"] = "gptq_quantized_models/"
int8_gptq_eval_config["onnx_model"] = "gptq_quantized_models/"

evaluate(int8_gptq_eval_config)

Last, specify path to the MatMul quantized model and record its accuracy

matmul_eval_config = copy.deepcopy(eval_config)
matmul_eval_config["model_name_or_path"] = "gptq_MATMUL_NBITS_quantized_models"
matmul_eval_config["onnx_model"] = "gptq_MATMUL_NBITS_quantized_models"

evaluate(matmul_eval_config)

The following table contains the expected results, but please note that different machines can lead to minor variations in the accuracy of quantized model.

	Float Model	INT8 Quantized Model	INT8 + GPTQ Quantized Model	MatMul Quantized Model
Model Size	480 MB	384 MB	384 MB	406 MB
Perplexity	27.0317	28.6846	27.5734	30.3604

Quark ONNX Quantization Tutorial For GPTQ

Contents