Pruning#

Note

For information on accessing Quark PyTorch examples, refer to Accessing PyTorch Examples. This example and the relevant files are available at /torch/language_modeling/llm_pruning.

This topic contains examples of pruning language models (such as OPT and Llama) using Quark.

Supported Models#

Model Name

Model Size

Pruning Rate

Pruned Model Size

Before Pruning PPL On Wiki2

After Pruning PPL On Wiki2

mistralai/Mixtral-8x7B-Instruct-v0.1

46.7B

9.4838%

42.2B

4.1370

5.1195

CohereForAI/c4ai-command-r-08-2024

32.3B

7.4025%

29.9B

4.5081

6.3794

Qwen/Qwen2.5-14B-Instruct

14.8B

7.0284%

13.7B

5.6986

7.5994

meta-llama/Meta-Llama-3-8B

8.0B

6.8945%

7.5B

6.1382

8.0755

meta-llama/Llama-2-7b-hf

6.7B

6.7224%

6.2B

5.4721

6.2462

facebook/opt-6.7b

6.7B

7.5651%

6.2B

10.8602

11.8958

THUDM/chatglm3-6b

6.2B

7.7590%

5.6B

29.9560

36.0010

microsoft/Phi-3.5-mini-instruct

3.8B

5.9274%

3.6B

6.1959

7.8074

Preparation#

For Llama2 models, download the HF Llama2 checkpoint. Access the Llama2 models checkpoint by submitting a permission request to Meta. For additional details, see the Llama2 page on Huggingface. Upon obtaining permission, download the checkpoint to the [llama2_checkpoint_folder].

Pruning Scripts#

Run the following Python scripts in the examples/torch/language_modeling/llm_pruning path. Use Llama2-7b as an example.

Note

  • To avoid memory limitations, GPU users can add the --multi_gpu argument when running the model on multiple GPUs.

  • CPU users should add the --device cpu argument.

Recipe 1: Evaluation of Llama2 Float16 Model without Pruning#

python3 main.py --model_dir [llama2 checkpoint folder] \
                         --skip_pruning

Recipe 2: Pruning Model and Saved to Safetensors#

python3 main.py --model_dir [llama2 checkpoint folder] \
                         --pruning_algo "osscar" \
                         --num_calib_data 128 \
                         --save_pruned_model \
                         --save_dir save_dir