Debug GPU Issues with AI Agents

When training crashes or inference slows down, your AI agent can SSH into the instance, inspect GPU state, read logs, and suggest concrete fixes.

”My training script is crashing with CUDA OOM. Can you check what’s happening?”

The agent investigates the GPU and the error:

list_instances — finds your running instances
ssh_execute — runs nvidia-smi on the target instance
ssh_execute — runs tail -50 /root/train.log to find the crash

Instance: training-run (1x A100 40GB) GPU memory: 39,800 / 40,960 MiB (97% — nearly full) The log shows:
torch.cuda.OutOfMemoryError: CUDA out of memory.
Tried to allocate 2.00 GiB. GPU 0 has 1.12 GiB free.
You’re running out of memory during the backward pass. Three options:

Reduce batch size from 16 to 8 — halves peak memory

Enable gradient checkpointing — add model.gradient_checkpointing_enable() before training

Upgrade GPU — deploy on an A100 80GB for 2x the VRAM at $1.60/hr

”Check if there are any zombie processes eating GPU memory on my dev-box.”

The agent inspects processes on the GPU:

ssh_execute — runs nvidia-smi to see GPU process list
ssh_execute — runs ps aux | grep python to cross-reference

Found 3 Python processes using GPU memory on dev-box:
PID GPU Memory Command
1842 12,400 MiB python train.py (your active run)
1203 8,200 MiB python eval.py (finished 2 hours ago, still holding memory)
980 4,100 MiB python -m jupyter (idle notebook kernel)
PIDs 1203 and 980 are zombie processes. Want me to kill them? That would free 12.3 GB of VRAM.

PID	GPU Memory	Command
1842	12,400 MiB	`python train.py` (your active run)
1203	8,200 MiB	`python eval.py` (finished 2 hours ago, still holding memory)
980	4,100 MiB	`python -m jupyter` (idle notebook kernel)

If you say yes, the agent runs:

ssh_execute — kill 1203 1980
ssh_execute — nvidia-smi again to confirm memory is freed

Killed 2 zombie processes. GPU memory dropped from 24.7 GB to 12.4 GB. You now have 28.5 GB free.

”My inference server is responding slowly. What’s wrong?”

The agent checks GPU utilization, system resources, and the serving process:

get_instance — confirms instance specs
ssh_execute — runs nvidia-smi dmon -s u -c 5 (5 samples of GPU utilization)
ssh_execute — runs free -h to check RAM
ssh_execute — runs df -h / to check disk space
ssh_execute — runs tail -30 /root/vllm.log to check the serving logs

Instance: inference-prod (1x H100 80GB)

GPU utilization: averaging 15% (not compute-bound)

GPU memory: 72 GB / 80 GB (healthy)

System RAM: 58 GB / 64 GB (92% — swapping likely)

Disk: 94% full — only 12 GB free

The bottleneck is system RAM, not GPU. vLLM is likely swapping to disk during request preprocessing. Two fixes:

Clear disk cache: rm -rf /root/.cache/pip /tmp/*

Restart vLLM with --max-num-seqs 64 (down from default 256) to reduce CPU-side memory

Tools used in this workflow

Tool	Purpose
`list_instances`	Find the problematic instance
`get_instance`	Check instance specs and configuration
`ssh_execute`	Run `nvidia-smi`, read logs, kill processes, check system resources

​”My training script is crashing with CUDA OOM. Can you check what’s happening?”

​”Check if there are any zombie processes eating GPU memory on my dev-box.”

​”My inference server is responding slowly. What’s wrong?”

​Tools used in this workflow

”My training script is crashing with CUDA OOM. Can you check what’s happening?”

”Check if there are any zombie processes eating GPU memory on my dev-box.”

”My inference server is responding slowly. What’s wrong?”

Tools used in this workflow