Deploy and Train Models with AI Agents

Walk through a full training workflow without leaving your editor. Your AI agent handles GPU provisioning, file transfers, training execution, and cleanup.

”Deploy an A100, upload my dataset, and start training.”

The agent orchestrates the entire setup:

list_gpu_types — confirms A100 80GB availability and pricing ($1.60/hr)
create_instance — deploys an instance named fine-tune-run with an A100
instance_status — polls every 10 seconds until the instance is running
file_upload — uploads train.py to /workspace/train.py
file_upload — uploads dataset.jsonl to /workspace/dataset.jsonl
ssh_execute — runs pip install torch transformers datasets accelerate
ssh_execute — runs nohup python /workspace/train.py > /root/train.log 2>&1 &

Deployed fine-tune-run (1x A100 80GB, $1.60/hr). Uploaded train.py and dataset.jsonl. Training started — check progress with “how’s the training going?"

"How’s the training going? Show me GPU usage and the last few log lines.”

The agent checks two things in parallel:

ssh_execute — runs nvidia-smi --query-gpu=utilization.gpu,memory.used,memory.total --format=csv,noheader
ssh_execute — runs tail -20 /root/train.log

GPU utilization: 97% | Memory: 68,420 / 81,920 MiB (83%)
Epoch 2/5 | Step 1240/3100 | Loss: 0.4812 | LR: 2.1e-5
Epoch 2/5 | Step 1260/3100 | Loss: 0.4756 | LR: 2.0e-5
Epoch 2/5 | Step 1280/3100 | Loss: 0.4701 | LR: 2.0e-5
Loss is decreasing steadily. At the current rate, training will finish in ~2.5 hours.

”Training is done. Download the model checkpoint and tear down the instance.”

The agent wraps up the workflow:

ssh_execute — runs ls -lh /workspace/output/ to confirm checkpoint files
file_download — downloads /workspace/output/adapter_model.safetensors
file_download — downloads /workspace/output/adapter_config.json
file_download — downloads /workspace/output/training_args.json
billing_usage — checks total spend for this session
delete_instance — terminates fine-tune-run

Downloaded 3 files (adapter weights, config, training args). Total cost for this training run: $6.40 (4 hours on A100). Instance fine-tune-run terminated. Billing stopped.

Tools used in this workflow

Tool	Purpose
`list_gpu_types`	Check GPU availability and pricing
`create_instance`	Provision the training machine
`instance_status`	Wait for deployment to complete
`file_upload`	Transfer training code and data
`ssh_execute`	Install dependencies, start training, check logs
`file_download`	Retrieve trained model artifacts
`billing_usage`	Verify session cost
`delete_instance`	Clean up and stop billing

​”Deploy an A100, upload my dataset, and start training.”

​"How’s the training going? Show me GPU usage and the last few log lines.”

​”Training is done. Download the model checkpoint and tear down the instance.”

​Tools used in this workflow

”Deploy an A100, upload my dataset, and start training.”

"How’s the training going? Show me GPU usage and the last few log lines.”

”Training is done. Download the model checkpoint and tear down the instance.”

Tools used in this workflow