Language Model Evaluations in Quark#
This document provides comprehensive guidelines for evaluating LLMs and VLMs. Types of evaluations supported are summarized by table below.
Model Types |
Quark Quantized |
Pretrained |
Perplexity |
ROUGE |
METEOR |
|
---|---|---|---|---|---|---|
LLMs |
||||||
|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
VLMs |
||||||
|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|
X |
X |
X |
X |
X |
X |
Getting Started#
Ensure that Quark package is installed in your conda environment
Install additional requirements for evalautions from:
pip install -r requirements.txt
User Guide#
Below we share a list of recipies that enable the features above. Please click on each individual links below for more details.
Important Details#
1. PPL in LM-Evaluation-Harness vs PPL Feature in Table#
The PPL evaluations generated by --ppl
above report token-level PPL.
We want to note that LM-Evaluation-Harness also has the ability to calculate
PPL. However, LM-Evaluation-Harness reports world-level PPL. Therefore you
will see a difference between the two PPL results. We encourage you to
utilize ``–ppl`` to evaluate PPL as most gold-standard perplexity scores are
reported assuming token-level PPL.
2. ONNX Exported Models#
To run ROUGE and METEOR evaluations, ONNX models can be OGA exported or
Optimum exported.
To run LM-Evaluation-Harness tasks or PPL on ONNX models, we only support
OGA exported. For LM-Evaluation-Harness tasks rely on the genai_config.json
file generated by the OGA Model Builder. For PPL, ONNX models are imported via OGA.
3. Support for VLMs#
Current list of VLMs supported are
["Llama-3.2-11B-Vision", "Llama-3.2-90B-Vision", "Llama-3.2-11B-Vision-Instruct", "Llama-3.2-90B-Vision-Instruct"]
.
4. LLM_Eval_Harness vs LLM_Eval_Haness_Offline#
Use llm_eval.py --mode standard
for running end-to-end evaluation, both, on a single hardware (CPU or GPU)
Use llm_eval.py --mode offline
for ability to run model inference on hardware different from evaluations (i.e., predictions on NPU, evaluation metrics on GPU)
Note, llm_eval.py --mode offline
only focuses only on generation tasks. See example_quark_torch_llm_eval_harness_offline.rst for supported generation tasks.
5. Export Evaluation Results#
All evaluation results can be exported to a JSON file by using the --metrics_output_dir option
, which specifies the output directory.