LM-Evaluation Harness (Offline)#
We provide a multi-step flow to run LM-Evaluation Harness metrics offline for ONNX models. Offline mode is used to evaluate models generations on specific hardware (i.e., NPUs). The offline mode is invoked through llm_eval.py --mode offline
. Currently, only the below generation tasks are supported in offline mode.
Supported Tasks#
[gsm8k
, tinyGSM8k
]
Step-by-Step Process#
Below are the steps on how to use the offline mode.
Please make sure --num_fewshot
is set to 0 to allow for fair comparisons from
OGA model generations.
1. Retrieve dataset from LM-Eval-Harness#
Use --retrieve_dataset
to save dataset inputs.json and references.json.
Example shown with 20 samples of gsm8k:
python llm_eval.py \
--mode offline \
--retrieve_dataset \
--tasks gsm8k \
--limit 20 \
--num_fewshot 0
2. Export pre-trained Model-Of-Interest to ONNX#
Use OGA Model Builder to save ONNX Pretrained Model. See here for how to use OGA Model Builder.
3. Retrieve OGA references for pre-trained ONNX Model#
Use --oga_references
to save the OGA references for a particular pre-trained model.
Example shown with 20 samples of gsm8k for pre-trained Phi3.5-mini-instruct ONNX Model:
python llm_eval.py \
--mode offline \
--oga_references \
--inputs [path to inputs.json] \
--import_model_dir [path to Phi3.5-mini-instruct ONNX Model] \
--import_file_format onnx_format \
--tasks gsm8k \
--limit 20 \
--num_fewshot 0 \
--eor "<EOR>"
4. Get Baseline Evaluation Scores on pre-trained ONNX Model#
Use --eval_mode
to compare the pre-trained model’s references to the dataset references.
Example shown with comparing Phi3.5-mini-instruct ONNX model references to GSM8k references.
python llm_eval.py \
--mode offline \
--eval_mode \
--outputs_path [path to Phi3.5-mini-instruct OGA references.txt] \
--tasks gsm8k \
--limit 20 \
--num_fewshot 0 \
--eor "<EOR>"
5. Evaluate an optimized ONNX Model#
Now use --eval_mode
to compare an optimized model to the dataset references.
Example shown with comparing a quantized Phi3.5-mini-instruct ONNX model predictions to GSM8k references.
python llm_eval.py \
--mode offline \
--eval_mode \
--outputs_path [path to quantized model predictions.txt] \
--tasks gsm8k \
--limit 20 \
--num_fewshot 0 \
--eor "<EOR>"
Note: predictions.txt should follow the same format as references.txt from (4). This means, each model output must
be separated by a end-of-response delimiter such as "<EOR>"
. See example below of the formatting:
This would be the first model output.
<EOR>
This would be the second model output
<EOR>
1. Compare Scores from Step 4 and Step 5#
Compute the percent error between Step 4 and 5 to understand how the quantized model compares to the original pre-trained model.