GGUF Exporting#
Currently, only support asymmetric int4 per_group weight-only quantization, and the group_size must be 32.The models supported include Llama2-7b, Llama2-13b, Llama2-70b, and Llama3-8b.
Example of GGUF Exporting (for already quantized model)#
from quark.torch import export_gguf
model_dir = "meta-llama/Llama-2-7b-chat-hf"
export_gguf(quantized_model, output_dir="./output_dir", model_type="llama", tokenizer_path=model_dir)
After running the code above successfully, there will be a .gguf
file under output_dir, ./output_dir/llama.gguf
for example.