Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Instead of SSH-ing in manually to check nvidia-smi and tail logs, ask your AI agent. It can read logs, parse metrics, check system health, and estimate completion time.
”How’s training going on my A100 instance?”
The agent checks logs and GPU state:
ssh_execute — tail -20 /root/train.log
ssh_execute — nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total,temperature.gpu --format=csv,noheader
Training progress on training-run:
Epoch 3/10 | Step 4,200/14,000 | Loss: 0.3421 | LR: 1.8e-5
Epoch 3/10 | Step 4,220/14,000 | Loss: 0.3398 | LR: 1.8e-5
GPU health:
- Utilization: 96%
- Memory: 74,200 / 81,920 MiB (91%)
- Temperature: 72C (healthy, throttles at 83C)
Progress: 30% complete (4,200 / 14,000 steps).
Estimated time remaining: ~4.5 hours at the current step rate (~36 steps/min).
Loss is trending down steadily — training looks healthy.
”Check disk space and make sure I won’t run out mid-training.”
The agent inspects storage:
ssh_execute — df -h /
ssh_execute — du -sh /workspace/checkpoints/ 2>/dev/null
ssh_execute — ls -lhrt /workspace/checkpoints/ | tail -5
Disk usage on training-run:
| Mount | Used | Available | Capacity |
|---|
| / | 142 GB | 58 GB | 71% |
Checkpoint directory: 89 GB across 3 checkpoints:
checkpoint-1400/ 28 GB (epoch 1)
checkpoint-2800/ 30 GB (epoch 2)
checkpoint-4200/ 31 GB (epoch 3, latest)
Each checkpoint is ~30 GB. You have 7 more epochs, which means ~210 GB of additional checkpoints — but you only have 58 GB free.
You’ll run out of disk at epoch 5. Options:
- Delete old checkpoints:
rm -rf checkpoint-1400 checkpoint-2800 (frees 58 GB)
- Save only the latest checkpoint (add
--save_total_limit 2 to your training args)
- Write checkpoints to a mounted persistent volume instead
”Compare GPU utilization across all my running instances.”
The agent checks every instance:
list_instances — gets all running instances
ssh_execute on each — nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv,noheader
| Instance | GPU | Utilization | Memory | Status |
|---|
| training-run | A100 80GB | 96% | 74.2 / 80 GB | training actively |
| inference-prod | H100 80GB | 28% | 72.0 / 80 GB | serving (low traffic) |
| dev-box | RTX 4090 | 0% | 0.4 / 24 GB | idle |
training-run is fully utilized — healthy.
inference-prod is underutilized at 28% — could downgrade to A100 if traffic stays low.
dev-box is completely idle — consider terminating to save $0.35/hr.
| Tool | Purpose |
|---|
list_instances | Find all running instances |
ssh_execute | Check GPU stats, read logs, inspect disk usage |