Batch Operations with AI Agents

Deploy a fleet of instances, run the same command on all of them, and monitor everything at once. Your AI agent handles the coordination.

”Deploy 4 RTX 4090 instances for a hyperparameter sweep.”

The agent provisions the fleet:

list_gpu_types — confirms RTX 4090 availability and pricing
create_instance — creates sweep-1 with RTX 4090
create_instance — creates sweep-2 with RTX 4090
create_instance — creates sweep-3 with RTX 4090
create_instance — creates sweep-4 with RTX 4090
instance_status — polls all four until running

Deployed 4 instances:
Instance GPU Status Cost
sweep-1 RTX 4090 running $0.35/hr
sweep-2 RTX 4090 running $0.35/hr
sweep-3 RTX 4090 running $0.35/hr
sweep-4 RTX 4090 running $0.35/hr
Fleet burn rate: $1.40/hr. All 4 are ready for commands.

Instance	GPU	Status	Cost
sweep-1	RTX 4090	running	$0.35/hr
sweep-2	RTX 4090	running	$0.35/hr
sweep-3	RTX 4090	running	$0.35/hr
sweep-4	RTX 4090	running	$0.35/hr

”Install dependencies on all of them and start training with different learning rates.”

The agent runs commands across the fleet:

ssh_execute on each instance — pip install torch transformers datasets
file_upload on each instance — uploads train.py
ssh_execute on each instance — starts training with different args:
- sweep-1: python train.py --lr 1e-4
- sweep-2: python train.py --lr 5e-5
- sweep-3: python train.py --lr 1e-5
- sweep-4: python train.py --lr 5e-6

All 4 instances are running training with different learning rates:
Instance Learning Rate Status
sweep-1 1e-4 training started
sweep-2 5e-5 training started
sweep-3 1e-5 training started
sweep-4 5e-6 training started
Check back in ~30 minutes for results, or ask me to monitor progress.

Instance	Learning Rate	Status
sweep-1	1e-4	training started
sweep-2	5e-5	training started
sweep-3	1e-5	training started
sweep-4	5e-6	training started

”Check training status on all instances and tear down any that are done.”

The agent checks each instance and cleans up:

ssh_execute on each — tail -5 /root/train.log
ssh_execute on each — nvidia-smi --query-gpu=utilization.gpu --format=csv,noheader

Instance GPU Util Last Log Line Status
sweep-1 0% Training complete. Best loss: 0.312 done
sweep-2 0% Training complete. Best loss: 0.287 done
sweep-3 94% Epoch 4/5 Step 800/1000 Loss: 0.301 running
sweep-4 91% Epoch 3/5 Step 620/1000 Loss: 0.445 running
sweep-2 (lr=5e-5) has the best loss so far at 0.287. Terminating finished instances to save money.

Instance	GPU Util	Last Log Line	Status
sweep-1	0%	`Training complete. Best loss: 0.312`	done
sweep-2	0%	`Training complete. Best loss: 0.287`	done
sweep-3	94%	`Epoch 4/5 Step 800/1000 Loss: 0.301`	running
sweep-4	91%	`Epoch 3/5 Step 620/1000 Loss: 0.445`	running

file_download on sweep-1 and sweep-2 — downloads results
delete_instance — terminates sweep-1 and sweep-2

Downloaded results from sweep-1 and sweep-2. Terminated both. Fleet cost reduced to $0.70/hr (2 remaining).

Tools used in this workflow

Tool	Purpose
`list_gpu_types`	Check availability before bulk deployment
`create_instance` / `instance_status`	Deploy fleet and wait for readiness
`file_upload` / `ssh_execute`	Distribute code and run commands
`file_download`	Retrieve results from completed runs
`delete_instance`	Tear down finished instances

​”Deploy 4 RTX 4090 instances for a hyperparameter sweep.”

​”Install dependencies on all of them and start training with different learning rates.”

​”Check training status on all instances and tear down any that are done.”

​Tools used in this workflow

”Deploy 4 RTX 4090 instances for a hyperparameter sweep.”

”Install dependencies on all of them and start training with different learning rates.”

”Check training status on all instances and tear down any that are done.”

Tools used in this workflow