FP8 Quantization with Per-Channel Static Weights and Per-Token Dynamic Activations#

NOTE This tutorial can be downloaded for local execution on a Jupyter Notebook environment. Click here to download the source file.

Installation#

To get started, install:

pip install amd-quark transformers

The resulting model Qwen1.5-0.5B-ptpc is ready to be loaded into vLLM.

Code Overview#

Typically, quantizing a floating-point model with AMD Quark involves the following steps:

1) Load the original floating-point model#

from transformers import AutoModelForCausalLM, AutoTokenizer

ckpt_path = "Qwen/Qwen1.5-0.5B"
model = AutoModelForCausalLM.from_pretrained(ckpt_path, torch_dtype="auto")
model.eval()
tokenizer = AutoTokenizer.from_pretrained(ckpt_path)

2) Set the quantization configuration#

from quark.torch import LLMTemplate

model_config_type = model.config.model_type if hasattr(model.config, "model_type") else model.config.architectures[0]
template = LLMTemplate.get(model_config_type)
quant_config = template.get_config(scheme="ptpc_fp8", exclude_layers=["lm_head"])

3) Use the AMD Quark API to perform an in-place replacement of the model’s modules with quantized modules#

from quark.torch import ModelQuantizer

quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)

4) (Optional) Export the quantized model#

from quark.torch import export_safetensors

output_dir = ckpt_path.rstrip("/").split("/")[-1] + "-FP8-ptpc"
model = quantizer.freeze(model)
export_safetensors(model, output_dir)
tokenizer.save_pretrained(output_dir)

5) (Optional) Evaluate Accuracy#

Install vllm and lm-evaluation-harness:

pip install vllm lm_eval

Evaluate accuracy with lm_eval (for example on 200 samples of gsm8k):

lm_eval \
  --model vllm \
  --model_args pretrained=./Qwen1.5-0.5B-FP8-ptpc,add_bos_token=True \
  --tasks gsm8k  --num_fewshot 5 --batch_size auto --limit 200

FP8 Quantization with Per-Channel Static Weights and Per-Token Dynamic Activations

Contents