Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Go from zero to a live inference endpoint in a single conversation. Your AI agent provisions the GPU, installs the serving framework, starts the server, and hands you the URL.

”Set up a vLLM server with Llama 3.1 70B on an A100.”

The agent handles the full deployment:
  1. list_gpu_types — confirms A100 80GB pricing ($1.60/hr)
  2. create_instance — deploys llama-serve with A100 80GB
  3. instance_status — polls until running
  4. ssh_executepip install vllm
  5. ssh_execute — starts vLLM:
    nohup python -m vllm.entrypoints.openai.api_server \
      --model meta-llama/Llama-3.1-70B-Instruct \
      --max-model-len 8192 \
      --port 8000 --host 0.0.0.0 > /root/vllm.log 2>&1 &
    
  6. ssh_executesleep 30 && curl -s localhost:8000/health to verify
  7. get_instance — retrieves the public IP
Your vLLM server is live:
http://203.0.113.42:8000/v1/chat/completions
It’s OpenAI-compatible. Use it as a drop-in replacement:
from openai import OpenAI
client = OpenAI(base_url="http://203.0.113.42:8000/v1", api_key="unused")
Model: Llama 3.1 70B Instruct | GPU: A100 80GB | Cost: $1.60/hr

”Test the endpoint with a quick prompt and show me the latency.”

The agent sends a test request:
  1. ssh_execute — runs a timed curl request from the instance itself:
    time curl -s localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{"model":"meta-llama/Llama-3.1-70B-Instruct","messages":[{"role":"user","content":"What is vLLM?"}],"max_tokens":128}'
    
Response: “vLLM is a high-throughput serving engine for large language models that uses PagedAttention…” Latency:
  • Time to first token: ~140ms
  • Total generation (128 tokens): ~4.2s
  • Throughput: ~30 tokens/sec
Server is healthy and performing within expected range for 70B on A100.

”Check GPU usage on the serving instance. Is it handling load well?”

The agent inspects the server:
  1. ssh_executenvidia-smi
  2. ssh_executecurl -s localhost:8000/metrics | grep -E 'num_requests|gpu_cache'
  3. ssh_executetail -10 /root/vllm.log
GPU status:
  • Memory: 71,200 / 81,920 MiB (87% — model loaded, healthy headroom)
  • Utilization: 34% (moderate load)
Server metrics:
  • Active requests: 2
  • KV cache usage: 12%
  • Requests served (last hour): 847
The server has plenty of capacity. KV cache at 12% means you can handle ~8x more concurrent requests before hitting memory pressure.

Tools used in this workflow

ToolPurpose
list_gpu_typesCheck GPU availability and pricing
create_instance / instance_statusProvision and wait for deployment
ssh_executeInstall vLLM, start server, test endpoint, monitor
get_instanceRetrieve the public IP for the endpoint URL