Language Model Evaluations in Quark#

This document provides comprehensive guidelines for evaluating LLMs and VLMs. Types of evaluations supported are summarized by table below.

Model Types

Quark Quantized

Pretrained

Perplexity

ROUGE

METEOR

LM Eval Harness Tasks

LLMs

  • Torch

  • ONNX

X

VLMs

  • Torch

  • ONNX

X

X

X

X

X

X

Getting Started#

  1. Ensure that Quark package is installed in your conda environment

  2. Install additional requirements for evalautions from:

pip install -r requirements.txt

User Guide#

Below we share a list of recipies that enable the features above. Please click on each individual links below for more details.

Important Details#

1. PPL in LM-Eval-Harness vs PPL Feature in Table#

The PPL evaluations generated by (1) above report token-level PPL. We want to note that LM-Eval-Harness also has the ability to calculate PPL. However, LM-Eval-Harness reports world-level PPL. Therefore you will see a difference between the two PPL results. We encourage you to utilize (1) to evaluate PPL as most gold-standard perplexity scores are reported assuming token-level PPL.

2. ONNX Exported Models#

To run ROUGE and METEOR evaluations, ONNX models can be OGA exported or Optimum exported. To run LM-Eval-Harness on ONNX models, we only support OGA expoorted, currently, since we rely on the genai_config.json file generated by the OGA Model Builder.

3. Support for VLMs#

Current list of VLMs supported are ["Llama-3.2-11B-Vision", "Llama-3.2-90B-Vision", "Llama-3.2-11B-Vision-Instruct", "Llama-3.2-90B-Vision-Instruct"].