Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Deploy a fleet of instances, run the same command on all of them, and monitor everything at once. Your AI agent handles the coordination.

”Deploy 4 RTX 4090 instances for a hyperparameter sweep.”

The agent provisions the fleet:
  1. list_gpu_types — confirms RTX 4090 availability and pricing
  2. create_instance — creates sweep-1 with RTX 4090
  3. create_instance — creates sweep-2 with RTX 4090
  4. create_instance — creates sweep-3 with RTX 4090
  5. create_instance — creates sweep-4 with RTX 4090
  6. instance_status — polls all four until running
Deployed 4 instances:
InstanceGPUStatusCost
sweep-1RTX 4090running$0.35/hr
sweep-2RTX 4090running$0.35/hr
sweep-3RTX 4090running$0.35/hr
sweep-4RTX 4090running$0.35/hr
Fleet burn rate: $1.40/hr. All 4 are ready for commands.

”Install dependencies on all of them and start training with different learning rates.”

The agent runs commands across the fleet:
  1. ssh_execute on each instance — pip install torch transformers datasets
  2. file_upload on each instance — uploads train.py
  3. ssh_execute on each instance — starts training with different args:
    • sweep-1: python train.py --lr 1e-4
    • sweep-2: python train.py --lr 5e-5
    • sweep-3: python train.py --lr 1e-5
    • sweep-4: python train.py --lr 5e-6
All 4 instances are running training with different learning rates:
InstanceLearning RateStatus
sweep-11e-4training started
sweep-25e-5training started
sweep-31e-5training started
sweep-45e-6training started
Check back in ~30 minutes for results, or ask me to monitor progress.

”Check training status on all instances and tear down any that are done.”

The agent checks each instance and cleans up:
  1. ssh_execute on each — tail -5 /root/train.log
  2. ssh_execute on each — nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader
InstanceGPU UtilLast Log LineStatus
sweep-10%Training complete. Best loss: 0.312done
sweep-20%Training complete. Best loss: 0.287done
sweep-394%Epoch 4/5 Step 800/1000 Loss: 0.301running
sweep-491%Epoch 3/5 Step 620/1000 Loss: 0.445running
sweep-2 (lr=5e-5) has the best loss so far at 0.287. Terminating finished instances to save money.
  1. file_download on sweep-1 and sweep-2 — downloads results
  2. delete_instance — terminates sweep-1 and sweep-2
Downloaded results from sweep-1 and sweep-2. Terminated both. Fleet cost reduced to $0.70/hr (2 remaining).

Tools used in this workflow

ToolPurpose
list_gpu_typesCheck availability before bulk deployment
create_instance / instance_statusDeploy fleet and wait for readiness
file_upload / ssh_executeDistribute code and run commands
file_downloadRetrieve results from completed runs
delete_instanceTear down finished instances