Language Model Quantization Using Quark#
This document provides examples of quantizing and exporting the language models (OPT, Llama…) using Quark.
Supported Models#
Model Name |
FP8① |
INT② |
MX③ |
AWQ/GPTQ(INT)④ |
SmoothQuant |
Rotation |
---|---|---|---|---|---|---|
meta-llama/Llama-2-*-hf ⑤ |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
meta-llama/Llama-3-*-hf |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
meta-llama/Llama-3.1-*-hf |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
facebook/opt-* |
✓ |
✓ |
✓ |
✓ |
✓ |
|
EleutherAI/gpt-j-6b |
✓ |
✓ |
✓ |
✓ |
✓ |
|
THUDM/chatglm3-6b |
✓ |
✓ |
✓ |
✓ |
✓ |
|
Qwen/Qwen-* |
✓ |
✓ |
✓ |
✓ |
✓ |
|
Qwen/Qwen1.5-* |
✓ |
✓ |
✓ |
✓ |
✓ |
|
Qwen/Qwen1.5-MoE-A2.7B |
✓ |
✓ |
✓ |
✓ |
✓ |
|
Qwen/Qwen2-* |
✓ |
✓ |
✓ |
✓ |
✓ |
|
microsoft/phi-2 |
✓ |
✓ |
✓ |
✓ |
✓ |
|
microsoft/Phi-3-mini-*k-instruct |
✓ |
✓ |
✓ |
✓ |
✓ |
|
microsoft/Phi-3.5-mini-instruct |
✓ |
✓ |
✓ |
✓ |
✓ |
|
mistralai/Mistral-7B-v0.1 |
✓ |
✓ |
✓ |
✓ |
✓ |
|
mistralai/Mixtral-8x7B-v0.1 |
✓ |
✓ |
||||
hpcai-tech/grok-1 |
✓ |
✓ |
✓ |
|||
CohereForAI/c4ai-command-r-plus-08-2024 |
✓ |
|||||
CohereForAI/c4ai-command-r-08-2024 |
✓ |
|||||
CohereForAI/c4ai-command-r-plus |
✓ |
|||||
CohereForAI/c4ai-command-r-v01 |
✓ |
|||||
databricks/dbrx-instruct |
✓ |
|||||
deepseek-ai/deepseek-moe-16b-chat |
✓ |
Note
① FP8 means
OCP fp8_e4m3
data type quantization.② INT includes INT8, UINT8, INT4, UINT4 data type quantization.
③ MX includes OCP data type MXINT8, MXFP8E4M3, MXFP8E5M2, MXFP4, MXFP6E3M2, MXFP6E2M3.
④ GPTQ only supports QuantScheme as ‘PerGroup’ and ‘PerChannel’.
⑤
*
represents different model sizes, such as7b
.
Preparation#
Getting example code (For users reading from documentation)
Users can get the example code after downloading and unzipping quark.zip (referring to Installation Guide). The example folder is in quark.zip.
Directory Structure:
+ quark.zip + example/torch/language_modeling + quantize_quark.py # Main function for this example. + data_preparation.py # Prepares the calibration dataset. + configuration_preparation.py # Prepares quantization and export configurations.
Downloading the pre-trained floating-point model checkpoint (Optional)
Some models cannot be accessed directly. For Llama models, download the HF Llama checkpoint. The Llama models checkpoint can be accessed by submitting a permission request to Meta. For additional details, see the Llama page on Huggingface. Upon obtaining permission, download the checkpoint to the [llama2_checkpoint_folder_path]
.
Environment Preparation
If you are running in an environment that already has a transformers version below 4.44.0, please update it to version 4.44.0 or higher.
Quantization & Export Scripts#
Note:
To avoid memory limitations, GPU users can add the
--multi_gpu
argument when running the model on multiple GPUs.CPU users should add the
--device cpu
argument.For models in the supported list, users could add the
--pre_quantization_optimization smoothquant
argument for optimizing the accuracy of the quantized model in some cases.For Llama models, users could also add the
--pre_quantization_optimization rotation
argument for optimizing the accuracy of the quantized model in some cases.
Recipe 1: Evaluation of pre-trained float16 model without quantization
python3 quantize_quark.py --model_dir [model name or checkpoint folder path] \
--skip_quantization
Recipe 2: FP8(OCP fp8_e4m3) Quantization & JSON_SafeTensors_Export with KV cache
python3 quantize_quark.py --model_dir [model name or checkpoint folder path] \
--output_dir output_dir \
--quant_scheme w_fp8_a_fp8 \
--kv_cache_dtype fp8 \
--num_calib_data 128 \
--no_weight_matrix_merge \
--model_export quark_safetensors
Recipe 3: INT Weight Only Quantization & JSON_SafeTensors_Exportwith AWQ
python3 quantize_quark.py --model_dir [model name or checkpoint folder path] \
--output_dir output_dir \
--quant_scheme w_int4_per_group_sym \
--num_calib_data 128 \
--quant_algo awq \
--dataset pileval_for_awq_benchmark \
--seq_len 512 \
--model_export quark_safetensors
Recipe 4: INT Static Quantization & JSON_SafeTensors_Export
python3 quantize_quark.py --model_dir [model name or checkpoint folder path] \
--output_dir output_dir \
--quant_scheme w_int8_a_int8_per_tensor_sym \
--num_calib_data 128 \
--no_weight_matrix_merge \
--model_export quark_safetensors
Recipe 5: Quantization & GGUF_Export with AWQ (W_uint4 A_float16 per_group asymmetric)
python3 quantize_quark.py --model_dir [model name or checkpoint folder path] \
--output_dir output_dir \
--quant_scheme w_uint4_per_group_asym \
--quant_algo awq \
--num_calib_data 128 \
--group_size 32 \
--model_export gguf
If the code runs successfully, it will produce one gguf file in [output_dir]
and the terminal will display GGUF quantized model exported to ... successfully.
Recipe 6: MX Quantization
Quark now supports the datatype microscaling which is abbreviated as MX. Use the following command to quantize model to datatype MX:
python3 quantize_quark.py --model_dir [model name or checkpoint folder path] \
--output_dir output_dir \
--quant_scheme w_mx_fp8 \
--num_calib_data 32 \
--group_size 32
The command above is weight-only quantization. If users want activations to be quantized as well, use the command below:
python3 quantize_quark.py --model_dir [model name or checkpoint folder path] \
--output_dir output_dir \
--quant_scheme w_mx_fp8_a_mx_fp8 \
--num_calib_data 32 \
--group_size 32
Recipe 7: BFP16 Quantization
Quark now supports the datatype BFP16 which is short for block floating point 16 bits. Use the following command to quantize model to datatype BFP16:
python3 quantize_quark.py --model_dir [model name or checkpoint folder path] \
--output_dir output_dir \
--quant_scheme w_bfp16 \
--num_calib_data 16
The command above is weight-only quantization. If users want activations to be quantized as well, use the command below:
python3 quantize_quark.py --model_dir [model name or checkpoint folder path] \
--output_dir output_dir \
--quant_scheme w_bfp16_a_bfp16 \
--num_calib_data 16
Tutorial: Running a Model Not on the Supported List#
For a new model that is not listed in Quark, you need to modify some relevant files. There are several steps to follow.
Step 1: add the model type to
MODEL_NAME_PATTERN_MAP
inget_model_type
function in quantize_quark.py.Step 2: customize
tokenizer
for your model inget_tokenizer
function in quantize_quark.py.Step 3: [Optional] for some layers you don’t want to quantize, add them to
MODEL_NAME_EXCLUDE_LAYERS_MAP
in configuration_preparation.py.Step 4: [Optional] if quantizing
kv_cache
, you must add name of kv layers toMODEL_NAME_KV_LAYERS_MAP
in configuration_preparation.py.Step 5: [Optional] if using GPTQ, SmoothQuant and AWQ, add
awq_config.json
andgptq_config.json
for model.
Step 1: Add the model type to MODEL_NAME_PATTERN_MAP
in get_model_type
function in quantize_quark.py.#
MODEL_NAME_PATTERN_MAP
describes model type
, which is used to configure the quant_config for the models.
You can use the part of the model’s HF-ID as the key of the dictionary, and the lowercase version of this key as the value.
For CohereForAI/c4ai-command-r-v01
, you can add {"Cohere": "cohere"}
to MODEL_NAME_PATTERN_MAP
.
def get_model_type(model: nn.Module) -> str:
MODEL_NAME_PATTERN_MAP = {
"Llama": "llama",
"OPT": "opt",
...
"Cohere": "cohere", # <---- Add code HERE
}
for k, v in MODEL_NAME_PATTERN_MAP.items():
if k.lower() in type(model).__name__.lower():
return v
Step 2: Customize tokenizer
for your model in get_tokenizer
function in quantize_quark.py.#
For the most part, get_tokenizer
function is applicable. But for some models, such as CohereForAI/c4ai-command-r-v01
, use_fast
can only be set to True
(as of transformers-4.44.2
).
You can customize the tokenizer
by referring to your model’s Model card
on Hugging Face
and tokenization_auto.py in transformers
.
def get_tokenizer(ckpt_path: str, max_seq_len: int = 2048, model_type: Optional[str] = None) -> AutoTokenizer:
print(f"Initializing tokenizer from {ckpt_path}")
use_fast = True if model_type == "grok" or model_type == "cohere" else False
tokenizer = AutoTokenizer.from_pretrained(ckpt_path,
model_max_length=max_seq_len,
padding_side="left",
trust_remote_code=True,
use_fast=use_fast)
Step 3: [Optional] For some layers you don’t want to quantize, add them to MODEL_NAME_EXCLUDE_LAYERS_MAP
in configuration_preparation.py.#
Normally, if you are quantizing a MoE model, the gate
layers do not need to be quantized, or there are other layers that you do not want to quantize, you can add model_type
and excluding layer name
to MODEL_NAME_EXCLUDE_LAYERS_MAP
.
You can add the name of the layer or part of the name with wildcards.
For dbrx-instruct
, you can add "dbrx": ["lm_head", "*router.layer"]
to MODEL_NAME_EXCLUDE_LAYERS_MAP
.
Note that lm_head
is excluded by default.
MODEL_NAME_EXCLUDE_LAYERS_MAP = {
"llama": ["lm_head"],
"opt": ["lm_head"],
...
"cohere": ["lm_head"], # <---- Add code HERE
}
Step 4: [Optional] If quantizing kv_cache
, you must add name of kv layers to MODEL_NAME_KV_LAYERS_MAP
in configuration_preparation.py.#
When quantizing kv_cache
, you must add model_type
and kv layers name
to MODEL_NAME_KV_LAYERS_MAP
.
For facebook/opt-125m
, the full name of k_proj
is model.model.decoder.layer[0].self_attn.k_proj
(similar for v_proj
),
add the names with wildcards like "opt": ["*k_proj", "*v_proj"]
.
For chatglm
, you can add "chatglm": ["*query_key_value"]
.
MODEL_NAME_KV_LAYERS_MAP = {
"llama": ["*k_proj", "*v_proj"],
"opt": ["*k_proj", "*v_proj"],
...
"cohere": ["*k_proj", "*v_proj"], # <---- Add code HERE
}
Step 5: [Optional] If using GPTQ, SmoothQuant and AWQ, add awq_config.json
and gptq_config.json
for model.#
Quark relies on awq_config.json
and gptq_config.json
to execute GPTQ, SmoothQuant and AWQ.
Therefore, you must create a model directory named after the model_type
mentioned in Step1 under Quark/examples/torch/language_modeling/models
and create awq_config.json
and gptq_config.json
in this directory.
Take the meta-llama/Llama-2-7b
model as an example, we create directory named llama
in Quark/examples/torch/language_modeling/models
,
and create awq_config.json
and gptq_config.json
in Quark/examples/torch/language_modeling/models/llama
.
For GPTQ#
The config file should be named by gptq_config.json
. You should collate all linear layers in decoder layers, and put them in inside_layer_modules
list,
and put the decoder layers name in model_decoder_layers
list.
You can refer to Quark/examples/torch/language_modeling/models/*/gptq_config.json
, and find the configuration of a model with a similar structure to your model.
For SmoothQuant and AWQ#
SmoothQuant and AWQ use same file named awq_config.json
.
In general, for each decoder layer, you need to process four parts (linear_qkv, linear_o, linear_mlp_fc1, linear_mlp_fc2).
You should provide them with the previous adjacent layer (prev_op
), input layer (inp
), inspecting layer (module2inspect
).
If there is a necessary condition to inspect, you can use condition
to check, help
is optional and can provide additional information.
Additionally, when you quantize a model with GQA, num_attention_heads
and num_key_value_heads
should be added to awq_config.json
, and alpha
should be specified specifically as 0.85
, which influences how aggressively weights are quantized.
At last, put the decoder layers name in model_decoder_layers
.
You can refer to Quark/examples/torch/language_modeling/models/*/awq_config.json
, and find the configuration of a model with a similar structure to your model.
For example, models containing the GPA structure can refer to Quark/examples/torch/language_modeling/models/qwen2moe/awq_config.json
,
and those containing the moe structure can refer to Quark/examples/torch/language_modeling/models/grok/awq_config.json
.
Tutorial: Generating AWQ Configuration Automatically (Experimental)#
We provide a script awq_auto_config_helper.py to simplify user operations by quickly identifying modules compatible with the “AWQ” and “SmoothQuant” algorithms within the model through torch.compile.
Installation#
This script requires PyTorch version 2.4 or higher.
Usage#
The MODEL_DIR variable should be set to the model name from Hugging Face, such as facebook/opt-125m, Qwen/Qwen2-0.5B, or EleutherAI/gpt-j-6b.
To run the script, use the following command:
MODEL_DIR="your_model"
python awq_auto_config_helper.py --model_dir "${MODEL_DIR}"