Pruning#

Note

For information on accessing Quark PyTorch examples, refer to Accessing PyTorch Examples. This example and the relevant files are available at /torch/language_modeling/llm_pruning.

This topic contains examples of pruning language models (such as OPT and Llama) using Quark.

Supported Models#

Model Name	Model Size	Pruning Rate	Pruned Model Size	Before Pruning PPL On Wiki2	After Pruning PPL On Wiki2
mistralai/Mixtral-8x7B-Instruct-v0.1	46.7B	9.4838%	42.2B	4.1370	5.1195
CohereForAI/c4ai-command-r-08-2024	32.3B	7.4025%	29.9B	4.5081	6.3794
Qwen/Qwen2.5-14B-Instruct	14.8B	7.0284%	13.7B	5.6986	7.5994
meta-llama/Meta-Llama-3-8B	8.0B	6.8945%	7.5B	6.1382	8.0755
meta-llama/Llama-2-7b-hf	6.7B	6.7224%	6.2B	5.4721	6.2462
facebook/opt-6.7b	6.7B	7.5651%	6.2B	10.8602	11.8958
THUDM/chatglm3-6b	6.2B	7.7590%	5.6B	29.9560	36.0010
microsoft/Phi-3.5-mini-instruct	3.8B	5.9274%	3.6B	6.1959	7.8074

Preparation#

For Llama2 models, download the HF Llama2 checkpoint. Access the Llama2 models checkpoint by submitting a permission request to Meta. For additional details, see the Llama2 page on Huggingface. Upon obtaining permission, download the checkpoint to the [llama2_checkpoint_folder].

Pruning Scripts#

Run the following Python scripts in the examples/torch/language_modeling/llm_pruning path. Use Llama2-7b as an example.

Note

To avoid memory limitations, GPU users can add the --multi_gpu argument when running the model on multiple GPUs.
CPU users should add the --device cpu argument.

Recipe 1: Evaluation of Llama2 Float16 Model without Pruning#

python3 main.py --model_dir [llama2 checkpoint folder] \
                         --skip_pruning

Recipe 2: Pruning Model and Saved to Safetensors#

python3 main.py --model_dir [llama2 checkpoint folder] \
                         --pruning_algo "osscar" \
                         --num_calib_data 128 \
                         --save_pruned_model \
                         --save_dir save_dir

Pruning

Contents

Pruning#

Supported Models#

Preparation#

Pruning Scripts#

Recipe 1: Evaluation of Llama2 Float16 Model without Pruning#

Recipe 2: Pruning Model and Saved to Safetensors#