Pruning#

Note

For information on accessing Quark PyTorch examples, refer to Accessing PyTorch Examples. This example and the relevant files are available at /torch/language_modeling/llm_pruning.

This topic contains examples of pruning language models (such as OPT and Llama) using Quark.

Preparation#

For Llama2 models, download the HF Llama2 checkpoint. Access the Llama2 models checkpoint by submitting a permission request to Meta. For additional details, see the Llama2 page on Huggingface. Upon obtaining permission, download the checkpoint to the [llama2_checkpoint_folder].

Pruning Scripts#

Run the following Python scripts in the examples/torch/language_modeling/llm_pruning path. Use Llama2-7b as an example.

Note

  • To avoid memory limitations, GPU users can add the --multi_gpu argument when running the model on multiple GPUs.

  • CPU users should add the --device cpu argument.

Recipe 1: Evaluation of Llama2 Float16 Model without Pruning#

python3 main.py --model_dir [llama2 checkpoint folder] \
                         --skip_pruning

Recipe 2: Pruning Model and Saved to Safetensors#

python3 main.py --model_dir [llama2 checkpoint folder] \
                         --pruning_algo "osscar" \
                         --num_calib_data 128 \
                         --save_pruned_model \
                         --save_dir save_dir