Language Model QAT Using Quark and Trainer#
This document provides examples of Quantization-Aware Training (QAT) for language models using Quark.
Note
For information on accessing Quark PyTorch examples, refer to Accessing PyTorch Examples.
This example and the relevant files are available at /torch/language_modeling/llm_qat
.
Supported Models#
Model Name |
WEIGHT-ONLY (INT4.g128) |
---|---|
microsoft/Phi-3-mini-4k-instruct |
✓ |
THUDM/chatglm3-6b |
✓ |
Preparation#
Please install the required packages before running QAT by executing pip install -r requirements.txt
. To evaluate the model, install the necessary dependencies by running pip install -r ../llm_eval/requirements.txt
.
If an NCCL timeout error occurs while saving the model during the program’s execution, you can try installing the accelerate==1.4.0 version to resolve it.
(Optional) For LLM models, download the Hugging Face checkpoint.
QAT Scripts#
You can run the following Python scripts in the examples/torch/language_modeling/llm_qat
path. Here, Phi-3-mini-4k-instruct is used as an example.
Recipe 1: QAT Finetuning ChatGLM and Export to Safetensors using FSDP#
SECONDS=0
log_file=${log_dir}/llm_qat_${model_name}_finetune.log
output_dir="./quantized_model/chatglm_6b"
NUM_GPUS=4
BATCH_SIZE_PER_GPU=2
TOTAL_BATCH_SIZE=32
GRADIENT_ACC_STEPS=$(($TOTAL_BATCH_SIZE/$NUM_GPUS/$BATCH_SIZE_PER_GPU))
FSDP_CONFIG=./fsdp_configs/chatglm_fsdp_config.json
NUM_EPOCHS=5
LR=2e-5
MAX_SEQ_LEN=512
CUDA_VISIBLE_DEVICES=0,1,2,3 accelerate launch --num_processes=${NUM_GPUS} main.py \
--fsdp "full_shard auto_wrap" \
--fsdp_config ${FSDP_CONFIG} \
--model ${MODEL_DIR} \
--model_trust_remote_code \
--quant_scheme w_uint4_asym \
--group_size 128 \
--finetune_dataset wikitext \
--num_train_epochs ${NUM_EPOCHS} \
--learning_rate ${LR} \
--finetune_seqlen ${MAX_SEQ_LEN} \
--per_device_train_batch_size ${BATCH_SIZE_PER_GPU} \
--per_device_eval_batch_size ${BATCH_SIZE_PER_GPU} \
--model_export hf_format \
--output_dir $finetune_checkpoint \
--model_export_dir ${output_dir} \
--gradient_accumulation_steps ${GRADIENT_ACC_STEPS} \
--skip_evaluation 2>&1| tee $log_file
date -ud "@$SECONDS" "+Time elapsed: %H:%M:%S" |tee -a ${log_file}
TOTAL_TIME=$((TOTAL_TIME+SECONDS))
Recipe 2: Reload and Evaluate QAT Finetuned Model#
SECONDS=0
log_file=${log_dir}/llm_qat_${model_name}_test_finetuned.log
EVAL_BATCH=4
export CUDA_VISIBLE_DEVICES=5
EVAL_TASK=wikitext,winogrande,mmlu
EVAL_OUTPUT_PATH=./${model_name}_${EVAL_TASK//,/_}_quantized_eval_results
python main.py \
--model ${MODEL_DIR} \
--output_dir $finetune_checkpoint \
--model_trust_remote_code \
--skip_finetune \
--model_reload \
--import_model_dir $output_dir \
--eval_result_output_path ${EVAL_OUTPUT_PATH} \
--per_device_eval_batch_size ${EVAL_BATCH} \
--eval_task ${EVAL_TASK} 2>&1| tee $log_file
date -ud "@$SECONDS" "+Time elapsed: %H:%M:%S" | tee -a ${log_file}
TOTAL_TIME=$((TOTAL_TIME+SECONDS))
Recipe 3: Evaluate Original Non-Quantized Model#
EVAL_TASK=wikitext,winogrande,mmlu
EVAL_OUTPUT_PATH=./${model_name}_${EVAL_TASK//,/_}_non_quantized_eval_results
SECONDS=0
EVAL_BATCH=4
log_file=${log_dir}/llm_qat_${model_name}_test_bf16.log
export CUDA_VISIBLE_DEVICES=4
python main.py \
--model ${MODEL_DIR} \
--output_dir $finetune_checkpoint \
--model_trust_remote_code \
--skip_quantization \
--skip_finetune \
--eval_result_output_path ${EVAL_OUTPUT_PATH} \
--per_device_eval_batch_size ${EVAL_BATCH} \
--eval_task ${EVAL_TASK} 2>&1| tee ${log_file}
date -ud "@$SECONDS" "+Time elapsed: %H:%M:%S" |tee -a ${log_file}
TOTAL_TIME=$((TOTAL_TIME+SECONDS))
Results on Phi-3-mini-4k-instruct#
Model Name |
Wikitext PPL (Quark) |
Wikitext PPL (LLM harness) |
MMLU |
Winogrande |
---|---|---|---|---|
BF16 |
6.19 |
10.32 |
68.59 |
74.42 |
QAT Trainer |
6.21 |
11.51 |
65.97 |
73.24 |
Results on ChatGLM3-6B#
Model Name |
Wikitext PPL (Quark) |
Wikitext PPL (LLM harness) |
MMLU |
Winogrande |
---|---|---|---|---|
BF16 |
29.93 |
51.30 |
50.45 |
62.35 |
QAT Trainer |
9.84 |
29.97 |
49.36 |
65.50 |