Set Up Model Serving with AI Agents

Go from zero to a live inference endpoint in a single conversation. Your AI agent provisions the GPU, installs the serving framework, starts the server, and hands you the URL.

”Set up a vLLM server with Llama 3.1 70B on an A100.”

The agent handles the full deployment:

list_gpu_types — confirms A100 80GB pricing ($1.60/hr)
create_instance — deploys llama-serve with A100 80GB
instance_status — polls until running
ssh_execute — pip install vllm

ssh_execute — starts vLLM:

nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --max-model-len 8192 \
  --port 8000 --host 0.0.0.0 > /root/vllm.log 2>&1 &

ssh_execute — sleep 30 && curl -s localhost:8000/health to verify
get_instance — retrieves the public IP

Your vLLM server is live:
http://203.0.113.42:8000/v1/chat/completions
It’s OpenAI-compatible. Use it as a drop-in replacement:
from openai import OpenAI
client = OpenAI(base_url="http://203.0.113.42:8000/v1", api_key="unused")
Model: Llama 3.1 70B Instruct | GPU: A100 80GB | Cost: $1.60/hr

”Test the endpoint with a quick prompt and show me the latency.”

The agent sends a test request:

ssh_execute — runs a timed curl request from the instance itself:

time curl -s localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"meta-llama/Llama-3.1-70B-Instruct","messages":[{"role":"user","content":"What is vLLM?"}],"max_tokens":128}'

Response: “vLLM is a high-throughput serving engine for large language models that uses PagedAttention…” Latency:

Time to first token: ~140ms

Total generation (128 tokens): ~4.2s

Throughput: ~30 tokens/sec

Server is healthy and performing within expected range for 70B on A100.

”Check GPU usage on the serving instance. Is it handling load well?”

The agent inspects the server:

ssh_execute — nvidia-smi
ssh_execute — curl -s localhost:8000/metrics | grep -E 'num_requests|gpu_cache'
ssh_execute — tail -10 /root/vllm.log

GPU status:

Memory: 71,200 / 81,920 MiB (87% — model loaded, healthy headroom)

Utilization: 34% (moderate load)

Server metrics:

Active requests: 2

KV cache usage: 12%

Requests served (last hour): 847

The server has plenty of capacity. KV cache at 12% means you can handle ~8x more concurrent requests before hitting memory pressure.

Tools used in this workflow

Tool	Purpose
`list_gpu_types`	Check GPU availability and pricing
`create_instance` / `instance_status`	Provision and wait for deployment
`ssh_execute`	Install vLLM, start server, test endpoint, monitor
`get_instance`	Retrieve the public IP for the endpoint URL

​”Set up a vLLM server with Llama 3.1 70B on an A100.”

​”Test the endpoint with a quick prompt and show me the latency.”

​”Check GPU usage on the serving instance. Is it handling load well?”

​Tools used in this workflow

”Set up a vLLM server with Llama 3.1 70B on an A100.”

”Test the endpoint with a quick prompt and show me the latency.”

”Check GPU usage on the serving instance. Is it handling load well?”

Tools used in this workflow