Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Run standardized benchmarks on LLMs using lm-evaluation-harness. Compare models on MMLU, HellaSwag, ARC, and more — on your own GPU, with full reproducibility.
1. Deploy a GPU instance
runcrate instances create --name eval --gpu A100 --template ubuntu-devbox
runcrate instances status eval
2. Install lm-eval-harness
runcrate ssh eval -- "pip install lm-eval[vllm] vllm"
3. Run a benchmark
Evaluate Llama 3.1 8B on MMLU (5-shot):
runcrate ssh eval -- "lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-3.1-8B-Instruct,tensor_parallel_size=1 \
--tasks mmlu \
--num_fewshot 5 \
--batch_size auto \
--output_path /workspace/results/llama-8b-mmlu"
4. Run a full benchmark suite
runcrate ssh eval -- "lm_eval --model vllm \
--model_args pretrained=meta-llama/Llama-3.1-8B-Instruct \
--tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k \
--num_fewshot 5 \
--batch_size auto \
--output_path /workspace/results/llama-8b-full"
5. Compare two models
runcrate ssh eval -- "lm_eval --model vllm \
--model_args pretrained=Qwen/Qwen2.5-7B-Instruct \
--tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k \
--num_fewshot 5 \
--batch_size auto \
--output_path /workspace/results/qwen-7b-full"
6. Download results
runcrate ssh eval -- "cat /workspace/results/llama-8b-full/results.json | python -m json.tool"
runcrate cp eval:/workspace/results/ ./eval-results/
Available benchmark tasks
| Task | Measures |
|---|
mmlu | Knowledge across 57 subjects |
hellaswag | Common-sense reasoning |
arc_challenge | Science reasoning (hard) |
truthfulqa_mc2 | Truthfulness |
gsm8k | Grade-school math |
humaneval | Code generation |
Tips
- Use
--batch_size auto to find the largest batch size that fits in VRAM.
- The vLLM backend is significantly faster than the default HuggingFace backend.
- For gated models, authenticate with
huggingface-cli login first.
- Run the same tasks with the same
num_fewshot across models for fair comparison.
Cleanup
runcrate instances delete eval