LM-Eval Harness Evaluations#

Below details how to run evaluations on LM-Eval-Harness tasks.

Summary of support:

Model Types

Quark Quantized

Pretrained

Perplexity

ROUGE

METEOR

LM Eval Harness Tasks

LLMs

  • Torch

  • ONNX

X

VLMs

  • Torch

  • ONNX

X

X

X

X

X

X

Recipes#

  • The --model hf arg is used to run lm-harness on all huggingface LLMs.

  • The --model hf multimodal arg is used to run lm-harness on supported VLMs. We currently support ["Llama-3.2-11B-Vision", "Llama-3.2-90B-Vision", "Llama-3.2-11B-Vision-Instruct", "Llama-3.2-90B-Vision-Instruct"].

  • The --tasks arg is used to specify dataset of choice. See here for supported tasks from lm-eval-harness.

LM-Eval-Harness on Torch Models#

  1. LM-Eval-Harness, using a pretrained LLM. Example with Llama2-7b-hf:

python llm_eval.py \
    --model_args pretrained=meta-llama/Llama-2-7b-hf \
    --model hf \
    --tasks mmlu_management
    --batch_size 1 \
    --device cuda

Alternatively, to load a local checkpoint:

python llm_eval.py \
    --model_args pretrained=[local checkpoint path] \
    --model hf \
    --tasks mmlu_management
    --batch_size 1 \
    --device cuda
  1. LM-Eval-Harness on a Quark Quantized model. Example with Llama-2-7b-chat-hf-awq-uint4-asym-g128-bf16-lmhead

python llm_eval.py \
    --model_args pretrained=meta-llama/Llama-2-7b-hf \
    --model_reload \
    --import_file_format hf_format \
    --import_model_dir [path to Llama-2-7b-chat-hf-awq-uint4-asym-g128-bf16-lmhead model] \
    --model hf \
    --tasks mmlu_management
    --batch_size 1 \
    --device cuda

LM-Eval-Harness on ONNX Models#

  1. LM-Eval-Harness on pretrained, ONNX Exported LLM: Example with Llama2-7b-hf:

python llm_eval.py \
    --model_args pretrained=meta-llama/Llama-2-7b-hf \
    --import_file_format onnx \
    --import_model_dir [path to Llama-2-7b-hf ONNX model] \
    --model hf \
    --tasks mmlu_management
    --batch_size 1 \
    --device cpu
  1. LM-Eval-Harness on Quark Quantized, ONNX Exported LLM: Example with Llama-2-7b-chat-hf-awq-int4-asym-gs128-onnx:

python llm_eval.py \
    --model_args pretrained=meta-llama/Llama-2-7b-hf \
    --import_file_format onnx_format \
    --import_model_dir [path to Llama-2-7b-chat-hf-awq-int4-asym-gs128-onnx model] \
    --model hf \
    --tasks mmlu_management
    --batch_size 1 \
    --device cpu

Other Arguments#

  1. Set --multi_gpu for multi-gpu support

  2. Set dtype by model_args dtype=float32 to change model dtype.

  3. See a list of supported args by LM-Eval-Harness here. A few noteworthy ones are --limit to limit the number of samples evaluated, --num_fewshot to specify num. of examples in fewshot setup.