Evaluate LLM Performance on Cloud GPU

Run standardized benchmarks on LLMs using lm-evaluation-harness. Compare models on MMLU, HellaSwag, ARC, and more — on your own GPU, with full reproducibility.

1. Deploy a GPU instance

runcrate instances create --name eval --gpu A100 --template ubuntu-devbox
runcrate instances status eval

2. Install lm-eval-harness

runcrate ssh eval -- "pip install lm-eval[vllm] vllm"

3. Run a benchmark

Evaluate Llama 3.1 8B on MMLU (5-shot):

runcrate ssh eval -- "lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1 \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path /workspace/results/llama-8b-mmlu"

4. Run a full benchmark suite

runcrate ssh eval -- "lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
  --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path /workspace/results/llama-8b-full"

5. Compare two models

runcrate ssh eval -- "lm_eval --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
  --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path /workspace/results/qwen-7b-full"

6. Download results

runcrate ssh eval -- "cat /workspace/results/llama-8b-full/results.json | python -m json.tool"
runcrate cp eval:/workspace/results/ ./eval-results/

Available benchmark tasks

Task	Measures
`mmlu`	Knowledge across 57 subjects
`hellaswag`	Common-sense reasoning
`arc_challenge`	Science reasoning (hard)
`truthfulqa_mc2`	Truthfulness
`gsm8k`	Grade-school math
`humaneval`	Code generation

Tips

Use --batch_size auto to find the largest batch size that fits in VRAM.
The vLLM backend is significantly faster than the default HuggingFace backend.
For gated models, authenticate with huggingface-cli login first.
Run the same tasks with the same num_fewshot across models for fair comparison.

Cleanup

runcrate instances delete eval

​1. Deploy a GPU instance

​2. Install lm-eval-harness

​3. Run a benchmark

​4. Run a full benchmark suite

​5. Compare two models

​6. Download results

​Available benchmark tasks

​Tips

​Cleanup

1. Deploy a GPU instance

2. Install lm-eval-harness

3. Run a benchmark

4. Run a full benchmark suite

5. Compare two models

6. Download results

Available benchmark tasks

Tips

Cleanup