LM-Eval Harness Evaluations#
Below details how to run evaluations on LM-Eval-Harness tasks.
Summary of support:
Model Types |
Quark Quantized |
Pretrained |
Perplexity |
ROUGE |
METEOR |
|
---|---|---|---|---|---|---|
LLMs |
||||||
|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|
✓ |
✓ |
X |
✓ |
✓ |
✓ |
VLMs |
||||||
|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|
X |
X |
X |
X |
X |
X |
Recipes#
The
--model hf
arg is used to run lm-harness on all huggingface LLMs.The
--model hf multimodal
arg is used to run lm-harness on supported VLMs. We currently support["Llama-3.2-11B-Vision", "Llama-3.2-90B-Vision", "Llama-3.2-11B-Vision-Instruct", "Llama-3.2-90B-Vision-Instruct"]
.The
--tasks
arg is used to specify dataset of choice. See here for supported tasks from lm-eval-harness.
LM-Eval-Harness on Torch Models#
LM-Eval-Harness, using a pretrained LLM. Example with
Llama2-7b-hf
:
python llm_eval.py \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--model hf \
--tasks mmlu_management
--batch_size 1 \
--device cuda
Alternatively, to load a local checkpoint:
python llm_eval.py \
--model_args pretrained=[local checkpoint path] \
--model hf \
--tasks mmlu_management
--batch_size 1 \
--device cuda
LM-Eval-Harness on a Quark Quantized model. Example with
Llama-2-7b-chat-hf-awq-uint4-asym-g128-bf16-lmhead
python llm_eval.py \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--model_reload \
--import_file_format hf_format \
--import_model_dir [path to Llama-2-7b-chat-hf-awq-uint4-asym-g128-bf16-lmhead model] \
--model hf \
--tasks mmlu_management
--batch_size 1 \
--device cuda
LM-Eval-Harness on ONNX Models#
LM-Eval-Harness on pretrained, ONNX Exported LLM: Example with
Llama2-7b-hf
:
python llm_eval.py \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--import_file_format onnx \
--import_model_dir [path to Llama-2-7b-hf ONNX model] \
--model hf \
--tasks mmlu_management
--batch_size 1 \
--device cpu
LM-Eval-Harness on Quark Quantized, ONNX Exported LLM: Example with
Llama-2-7b-chat-hf-awq-int4-asym-gs128-onnx
:
python llm_eval.py \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--import_file_format onnx_format \
--import_model_dir [path to Llama-2-7b-chat-hf-awq-int4-asym-gs128-onnx model] \
--model hf \
--tasks mmlu_management
--batch_size 1 \
--device cpu
Other Arguments#
Set
--multi_gpu
for multi-gpu supportSet
dtype
bymodel_args dtype=float32
to change model dtype.See a list of supported args by LM-Eval-Harness here. A few noteworthy ones are
--limit
to limit the number of samples evaluated,--num_fewshot
to specify num. of examples in fewshot setup.