Language Model Evaluations in Quark#

This document provides comprehensive guidelines for evaluating LLMs and VLMs. Types of evaluations supported are summarized by table below.

Model Types

Quark Quantized

Pretrained

Perplexity

ROUGE

METEOR

LM Eval Harness Tasks

LLMs

  • Torch

  • ONNX

VLMs

  • Torch

  • ONNX

X

X

X

X

X

X

Getting Started#

  1. Ensure that Quark package is installed in your conda environment

  2. Install additional requirements for evalautions from:

pip install -r requirements.txt

User Guide#

Below we share a list of recipies that enable the features above. Please click on each individual links below for more details.

Important Details#

1. PPL in LM-Evaluation-Harness vs PPL Feature in Table#

The PPL evaluations generated by --ppl above report token-level PPL. We want to note that LM-Evaluation-Harness also has the ability to calculate PPL. However, LM-Evaluation-Harness reports world-level PPL. Therefore you will see a difference between the two PPL results. We encourage you to utilize ``–ppl`` to evaluate PPL as most gold-standard perplexity scores are reported assuming token-level PPL.

2. ONNX Exported Models#

To run ROUGE and METEOR evaluations, ONNX models can be OGA exported or Optimum exported. To run LM-Evaluation-Harness tasks or PPL on ONNX models, we only support OGA exported. For LM-Evaluation-Harness tasks rely on the genai_config.json file generated by the OGA Model Builder. For PPL, ONNX models are imported via OGA.

3. Support for VLMs#

Current list of VLMs supported are ["Llama-3.2-11B-Vision", "Llama-3.2-90B-Vision", "Llama-3.2-11B-Vision-Instruct", "Llama-3.2-90B-Vision-Instruct"].

4. LLM_Eval_Harness vs LLM_Eval_Haness_Offline#

Use llm_eval.py --mode standard for running end-to-end evaluation, both, on a single hardware (CPU or GPU) Use llm_eval.py --mode offline for ability to run model inference on hardware different from evaluations (i.e., predictions on NPU, evaluation metrics on GPU) Note, llm_eval.py --mode offline only focuses only on generation tasks. See example_quark_torch_llm_eval_harness_offline.rst for supported generation tasks.

5. Export Evaluation Results#

All evaluation results can be exported to a JSON file by using the --metrics_output_dir option, which specifies the output directory.