Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Run standardized benchmarks on LLMs using lm-evaluation-harness. Compare models on MMLU, HellaSwag, ARC, and more — on your own GPU, with full reproducibility.

1. Deploy a GPU instance

runcrate instances create --name eval --gpu A100 --template ubuntu-devbox
runcrate instances status eval

2. Install lm-eval-harness

runcrate ssh eval -- "pip install lm-eval[vllm] vllm"

3. Run a benchmark

Evaluate Llama 3.1 8B on MMLU (5-shot):
runcrate ssh eval -- "lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1 \
  --tasks mmlu \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path /workspace/results/llama-8b-mmlu"

4. Run a full benchmark suite

runcrate ssh eval -- "lm_eval --model vllm \
  --model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
  --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path /workspace/results/llama-8b-full"

5. Compare two models

runcrate ssh eval -- "lm_eval --model vllm \
  --model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
  --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k \
  --num_fewshot 5 \
  --batch_size auto \
  --output_path /workspace/results/qwen-7b-full"

6. Download results

runcrate ssh eval -- "cat /workspace/results/llama-8b-full/results.json | python -m json.tool"
runcrate cp eval:/workspace/results/ ./eval-results/

Available benchmark tasks

TaskMeasures
mmluKnowledge across 57 subjects
hellaswagCommon-sense reasoning
arc_challengeScience reasoning (hard)
truthfulqa_mc2Truthfulness
gsm8kGrade-school math
humanevalCode generation

Tips

  • Use --batch_size auto to find the largest batch size that fits in VRAM.
  • The vLLM backend is significantly faster than the default HuggingFace backend.
  • For gated models, authenticate with huggingface-cli login first.
  • Run the same tasks with the same num_fewshot across models for fair comparison.

Cleanup

runcrate instances delete eval