Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Serve any open-source LLM behind an OpenAI-compatible endpoint on your own GPU. This guide uses vLLM, the production standard for LLM serving — the same engine Stripe uses to process 50M+ daily API calls.
What you’ll build
A self-hosted inference API that serves Llama 3.1 70B (or any model) on an A100/H100, accessible from anywhere via a public IP. You can point your existing OpenAI SDK code at it.
Why vLLM
vLLM uses PagedAttention to manage GPU memory efficiently — on an 80GB H100 running a 7B FP16 model, this means serving 100+ concurrent requests instead of ~30. The V1 engine (default since v0.6.0) added disaggregated prefill/decode, preventing long prompts from blocking in-flight requests.
Option A: CLI
1. Deploy the instance
runcrate instances create --name llm-server --gpu A100
Wait for it to deploy:
runcrate instances status llm-server
2. Install vLLM and start the server
runcrate ssh llm-server -- "pip install vllm"
runcrate ssh llm-server -- "nohup python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--port 8000 \
--host 0.0.0.0 \
> /root/vllm.log 2>&1 &"
3. Test it
# Get the instance IP
runcrate instances info llm-server
# Hit the API
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "What is PagedAttention?"}],
"max_tokens": 256
}'
4. Point your app at it
from openai import OpenAI
client = OpenAI(
base_url="http://<INSTANCE_IP>:8000/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": "Explain vLLM in one sentence."}],
)
print(response.choices[0].message.content)
Option B: Python SDK
from runcrate import Runcrate
import time
client = Runcrate(api_key="rc_live_...")
# Deploy an A100
instance = client.instances.create(
name="llm-server",
gpu_type="A100",
gpu_count=1,
startup_commands=[
"pip install vllm",
"nohup python -m vllm.entrypoints.openai.api_server "
"--model meta-llama/Llama-3.1-70B-Instruct "
"--tensor-parallel-size 1 "
"--max-model-len 8192 "
"--port 8000 --host 0.0.0.0 > /root/vllm.log 2>&1 &",
],
)
# Wait for deployment
while True:
status = client.instances.get_status(instance.id)
if status.status == "deployed":
print(f"Server ready at http://{status.ip}:8000")
break
time.sleep(10)
Option C: MCP (via Claude Code / Cursor)
“Deploy an A100 instance called llm-server. Once it’s ready, install vLLM and start serving Llama 3.1 70B on port 8000. Give me the IP when it’s up.”
Your AI assistant will:
- Call
create_instance with name: "llm-server" and gpu: "A100"
- Poll
instance_status until deployed
- Call
ssh_execute to install vLLM and start the server
- Return the IP from
get_instance
Multi-GPU serving
For larger models (70B+ at FP16, or 405B with quantization), use tensor parallelism across multiple GPUs:
runcrate instances create --name llm-server-4gpu --gpu H100 --gpu-count 4
runcrate ssh llm-server-4gpu -- "nohup python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-405B-Instruct-FP8 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--port 8000 --host 0.0.0.0 > /root/vllm.log 2>&1 &"
Monitoring
# Check GPU memory and utilization
runcrate ssh llm-server -- nvidia-smi
# Check vLLM logs
runcrate ssh llm-server -- "tail -50 /root/vllm.log"
# Check active request count
runcrate ssh llm-server -- "curl -s localhost:8000/metrics | grep vllm_num_requests"
Cleanup
runcrate instances delete llm-server