Language Model Evaluations in Quark#
This document provides comprehensive guidelines for evaluating LLMs and VLMs. Types of evaluations supported are summarized by table below.
Model Types |
Quark Quantized |
Pretrained |
Perplexity |
ROUGE |
METEOR |
|
---|---|---|---|---|---|---|
LLMs |
||||||
|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|
✓ |
✓ |
X |
✓ |
✓ |
✓ |
VLMs |
||||||
|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|
X |
X |
X |
X |
X |
X |
Getting Started#
Ensure that Quark package is installed in your conda environment
Install additional requirements for evalautions from:
pip install -r requirements.txt
User Guide#
Below we share a list of recipies that enable the features above. Please click on each individual links below for more details.
Important Details#
1. PPL in LM-Eval-Harness vs PPL Feature in Table#
The PPL evaluations generated by (1) above report token-level PPL. We want to note that LM-Eval-Harness also has the ability to calculate PPL. However, LM-Eval-Harness reports world-level PPL. Therefore you will see a difference between the two PPL results. We encourage you to utilize (1) to evaluate PPL as most gold-standard perplexity scores are reported assuming token-level PPL.
2. ONNX Exported Models#
To run ROUGE and METEOR evaluations, ONNX models can be OGA exported or
Optimum exported. To run LM-Eval-Harness on ONNX models, we only support
OGA expoorted, currently, since we rely on the genai_config.json
file generated by the OGA Model Builder.
3. Support for VLMs#
Current list of VLMs supported are
["Llama-3.2-11B-Vision", "Llama-3.2-90B-Vision", "Llama-3.2-11B-Vision-Instruct", "Llama-3.2-90B-Vision-Instruct"]
.