Language Model Quantization using Quark#
This document provides examples of quantizing and exporting the language
models(OPT, Llama…) using Quark. We offer several scripts for various
quantization strategies. If users wish to apply their own Customer
Settings for the calibration method
, dataset
, or
quant config
, they can refer to the User
Guide for help.
Models supported#
Model Family |
Model Name |
FP16 |
BFP16 |
FP8(+FP8_KV_CACHE) |
W_UIN4(Per group)+A_BF16(+AWQ/GPTQ) |
INT8 |
SmoothQuant |
FP8 SafeTensors Export |
INT8 ONNX Export |
---|---|---|---|---|---|---|---|---|---|
LLAMA 2 |
meta-llama/Llama-2-*-hf |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
LLAMA 3 |
meta-llama/Llama-3-*-hf |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
OPT |
facebook/opt-* |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✗ |
✓ |
Qwen 1.5 |
Qwen/Qwen1.5-* |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
✗ |
✓ |
Note: *
represents different model sizes, such as 7b
.
Preparation#
For Llama2 models, download the HF Llama2 checkpoint. The Llama2 models
checkpoint can be accessed by submitting a permission request to Meta.
For additional details, see the Llama2 page on
Huggingface.
Upon obtaining permission, download the checkpoint to the
[llama2_checkpoint_folder]
.
Quantization & Export Scripts#
You can run the following python scripts in the
examples/torch/language_modeling
path. Here we use Llama2-7b as an
example.
Recipe 1: Evaluation of Llama2 float16 model without quantization
python3 quantize_quark.py --model_dir [llama2 checkpoint folder] \
--skip_quantization
Llama2-7b perplexity with wikitext dataset (on A100 GPU): 5.4720
Recipe 2: Quantization of Llama2 with AWQ (W_uint4 A_float16 per_group asymmetric)
python3 quantize_quark.py --model_dir [llama2 checkpoint folder] \
--output_dir output_dir \
--quant_scheme w_uint4_per_group_asym \
--num_calib_data 128 \
--quant_algo awq \
--dataset pileval_for_awq_benchmark \
--seq_len 512
Llama2-7b perplexity with wikitext dataset (on A100 GPU): 5.6168
Recipe 3: Quantization of & vLLM_Adopt_SafeTensors_Export Llama2 with W_int4 A_float16 per_group symmetric
python3 quantize_quark.py --model_dir [llama2 checkpoint folder] \
--output_dir output_dir \
--quant_scheme w_int4_per_group_sym \
--num_calib_data 128 \
--model_export vllm_adopted_safetensors
If the code runs successfully, it will produce one JSON file and one
.safetensor file in [output_dir]
and the terminal will display
[Quark] Quantized model exported to ... successfully.
Llama2-7b
perplexity with wikitext dataset (on A100 GPU): 5.7912
Recipe 4: Quantization & vLLM_Adopt_SafeTensors_Export of W_FP8_A_FP8 with FP8 KV cache
python3 quantize_quark.py --model_dir [llama2 checkpoint folder] \
--output_dir output_dir \
--quant_scheme w_fp8_a_fp8 \
--kv_cache_dtype fp8 \
--num_calib_data 128 \
--model_export vllm_adopted_safetensors
If the code runs successfully, it will produce one JSON file and one
.safetensor file in [output_dir]
and the terminal will display
[Quark] Quantized model exported to ... successfully.
Llama2-7b perplexity with wikitext dataset (on A100 GPU): 5.5133
Recipe 5: Quantization & vLLM_Adopt_SafeTensors_Export of only W_FP8_A_FP8
python3 quantize_quark.py --model_dir [llama2 checkpoint folder] \
--output_dir output_dir \
--quant_scheme w_fp8_a_fp8 \
--num_calib_data 128 \
--model_export vllm_adopted_safetensors
If the code runs successfully, it will produce one JSON file and one
.safetensor file in [output_dir]
and the terminal will display
[Quark] Quantized model exported to ... successfully.
Llama2-7b perplexity with wikitext dataset (on A100 GPU): 5.5093
Recipe 6: Quantization & vLLM_Adopt_SafeTensors_Export of W_FP8_A_FP8_O_FP8
python3 quantize_quark.py --model_dir [llama2 checkpoint folder] \
--output_dir output_dir \
--quant_scheme w_fp8_a_fp8_o_fp8 \
--num_calib_data 128 \
--model_export vllm_adopted_safetensors
If the code runs successfully, it will produce one JSON file and one
.safetensor file in [output_dir]
and the terminal will display
[Quark] Quantized model exported to ... successfully.
Llama2-7b perplexity with wikitext dataset (on A100 GPU): 5.5487
Recipe 7: Quantization & vLLM_Adopt_SafeTensors_Export of W_FP8_A_FP8_O_FP8 without weight scaling factors of gate_proj and up_proj merged. And if option “–no_weight_matrix_merge” is not set, weight scaling factors of gate_proj and up_proj are merged.
python3 quantize_quark.py --model_dir [llama2 checkpoint folder] \
--output_dir output_dir \
--quant_scheme w_fp8_a_fp8_o_fp8 \
--num_calib_data 128 \
--model_export vllm_adopted_safetensors \
--no_weight_matrix_merge
If the code runs successfully, it will produce one JSON file and one
.safetensor file in [output_dir]
and the terminal will display
[Quark] Quantized model exported to ... successfully.
Recipe 8: Quantization & Torch compile of W_INT8_A_INT8_PER_TENSOR_SYM
python3 quantize_quark.py --model_dir [llama2 checkpoint folder] \
--output_dir output_dir \
--quant_scheme w_int8_a_int8_per_tensor_sym \
--num_calib_data 128 \
--device cpu \
--data_type bfloat16 \
--model_export torch_compile