Rouge & Meteor Evaluations#
Below details how to run ROUGE and METEOR evaluations. ROUGE and METEOR
scores are currently available for the following datasets
[samsum, xsum, cnn_dm]
, where cnn_dm
is an abbreviation for
cnn_dailymail
.
Summary of support:
Model Types |
Quark Quantized |
Pretrained |
Perplexity |
ROUGE |
METEOR |
|
---|---|---|---|---|---|---|
LLMs |
||||||
|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|
✓ |
✓ |
X |
✓ |
✓ |
✓ |
VLMs |
||||||
|
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
|
X |
X |
X |
X |
X |
X |
Recipes#
The
--rouge
and--meteor
specify the rouge and meteor task, respectively. You can run either or both.The
--num_eval_data
arg is used to specify the number of samples used from an eval dataset.The
--dataset
arg specifies the dataset. Select from[xsum, cnn_dm, samsum]
. Can specify multiple as comma-seperated:--dataset samsum,xsum
.
Rouge/Meteor on Torch Models#
Rouge and Meteor on 20 samples of XSUM, using a pretrained LLM. Example with
Llama2-7b-hf
:
python llm_eval.py \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--rouge \
--meteor
--dataset xsum
--trust_remote_code \
--batch_size 1 \
--num_eval_data 20 \
--device cuda
Alternatively, to load a local checkpoint:
python llm_eval.py \
--model_args pretrained=[local checkpoint path] \
--rouge \
--meteor
--dataset xsum
--trust_remote_code \
--batch_size 1 \
--num_eval_data 20 \
--device cuda
Rouge and Meteor on a Quark Quantized model. Example with
Llama-2-7b-chat-hf-awq-uint4-asym-g128-bf16-lmhead
python llm_eval.py \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--model_reload \
--import_file_format hf_format \
--import_model_dir [path to Llama-2-7b-chat-hf-awq-uint4-asym-g128-bf16-lmhead model] \
--rouge \
--meteor
--dataset xsum
--trust_remote_code \
--batch_size 1 \
--num_eval_data 20 \
--device cuda
Rouge/Meteor on ONNX Models#
Rouge and Meteor on pretrained, ONNX Exported LLM: Example with
Llama2-7b-hf
:
python llm_eval.py \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--import_file_format onnx \
--import_model_dir [path to Llama-2-7b-hf ONNX model] \
--rouge \
--meteor
--dataset xsum
--trust_remote_code \
--batch_size 1 \
--num_eval_data 20 \
--device cpu
Rouge and Meteor on Quark Quantized, ONNX Exported LLM: Example with
Llama-2-7b-chat-hf-awq-int4-asym-gs128-onnx
:
python llm_eval.py \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--import_file_format onnx_format \
- import_model_dir [path to Llama-2-7b-chat-hf-awq-int4-asym-gs128-onnx model] \
--rouge \
--meteor
--dataset xsum
--trust_remote_code \
--batch_size 1 \
--num_eval_data 20 \
--device cpu
Other Arguments#
Set
--multi_gpu
for multi-gpu supportSet
--save_metrics_to_csv
andmetrics_output_dir
to save scores to CSVSet
dtype
bymodel_args dtype=float32
to change model dtype.Set
--seq_len
for max sequence length on inputsSet
--max_new_toks
for max number of new tokens generated (excluding length of input tokens)