Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Run any Llama model on a dedicated GPU — from Llama 3.1 8B on an RTX 4090 to Llama 3.1 405B across four H100s. Three deployment paths: the Models API (zero infrastructure), vLLM self-hosting (full control), and Ollama (fast prototyping).
Which Llama model to pick
| Model | Parameters | GPU | VRAM needed | Approx. cost |
|---|
| Llama 3.1 8B Instruct | 8B | RTX 4090 (24 GB) | ~16 GB (FP16) | ~$0.35/hr |
| Llama 4 Scout | 17B active (109B total MoE) | A100 80 GB | ~70 GB (BF16) | ~$1.60/hr |
| Llama 3.1 70B Instruct | 70B | A100 80 GB | ~70 GB (BF16) | ~$1.60/hr |
| Llama 3.1 70B Instruct | 70B | 2x A100 40 GB | ~35 GB each | ~$2.40/hr |
| Llama 3.1 405B Instruct (FP8) | 405B | 4x H100 80 GB | ~50 GB each | ~$10.00/hr |
Rule of thumb: a model needs roughly 2x its parameter count in bytes of VRAM at FP16, or 1x at FP8/INT8. When in doubt, go one tier up — you can always downgrade later.
Option 1: Models API (easiest — no GPU needed)
The fastest path. Hit the Runcrate Models API directly and pay per token. No instance to manage, no vLLM to install, no GPU to provision.
curl
curl https://api.runcrate.ai/v1/chat/completions \
-H "Authorization: Bearer rc_live_YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
"messages": [
{"role": "user", "content": "Explain mixture-of-experts in two sentences."}
],
"max_tokens": 256
}'
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="https://api.runcrate.ai/v1",
api_key="rc_live_YOUR_API_KEY",
)
response = client.chat.completions.create(
model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages=[{"role": "user", "content": "What is Llama 4 Scout?"}],
max_tokens=256,
)
print(response.choices[0].message.content)
TypeScript (OpenAI SDK)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.runcrate.ai/v1",
apiKey: "rc_live_YOUR_API_KEY",
});
const response = await client.chat.completions.create({
model: "meta-llama/Llama-4-Scout-17B-16E-Instruct",
messages: [{ role: "user", content: "What is Llama 4 Scout?" }],
max_tokens: 256,
});
console.log(response.choices[0].message.content);
Works with any model in the catalog — swap the model string and go.
Option 2: Self-host with vLLM (full control)
Run your own OpenAI-compatible endpoint on a dedicated GPU. You control the model, the context length, the quantization, and the scaling.
Deploy Llama 3.1 8B (single RTX 4090)
runcrate instances create --name llama-8b --gpu RTX4090
Wait for deployment:
runcrate instances status llama-8b
Install vLLM and start serving:
runcrate ssh llama-8b -- "pip install vllm"
runcrate ssh llama-8b -- "nohup python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--port 8000 \
--host 0.0.0.0 \
> /root/vllm.log 2>&1 &"
Deploy Llama 3.1 70B (single A100 80 GB)
runcrate instances create --name llama-70b --gpu A100
runcrate ssh llama-70b -- "pip install vllm"
runcrate ssh llama-70b -- "nohup python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-70B-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--port 8000 \
--host 0.0.0.0 \
> /root/vllm.log 2>&1 &"
Deploy Llama 4 Scout (single A100 80 GB)
runcrate instances create --name llama-scout --gpu A100
runcrate ssh llama-scout -- "pip install vllm"
runcrate ssh llama-scout -- "nohup python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-4-Scout-17B-16E-Instruct \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--port 8000 \
--host 0.0.0.0 \
> /root/vllm.log 2>&1 &"
Deploy Llama 3.1 405B FP8 (4x H100)
The 405B model requires tensor parallelism across multiple GPUs:
runcrate instances create --name llama-405b --gpu H100 --gpu-count 4
runcrate ssh llama-405b -- "pip install vllm"
runcrate ssh llama-405b -- "nohup python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-405B-Instruct-FP8 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--port 8000 \
--host 0.0.0.0 \
> /root/vllm.log 2>&1 &"
Test your endpoint
# Get the instance IP
runcrate instances info llama-70b
# Hit the API
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-70B-Instruct",
"messages": [{"role": "user", "content": "What makes Llama open-weight?"}],
"max_tokens": 256
}'
Point your app at it
Once the server is running, point any OpenAI-compatible SDK at your instance:
from openai import OpenAI
client = OpenAI(
base_url="http://<INSTANCE_IP>:8000/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-70B-Instruct",
messages=[{"role": "user", "content": "Summarize the Llama 3.1 release."}],
)
print(response.choices[0].message.content)
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "http://<INSTANCE_IP>:8000/v1",
apiKey: "not-needed",
});
const response = await client.chat.completions.create({
model: "meta-llama/Llama-3.1-70B-Instruct",
messages: [{ role: "user", content: "Summarize the Llama 3.1 release." }],
});
console.log(response.choices[0].message.content);
Monitoring
# GPU memory and utilization
runcrate ssh llama-70b -- nvidia-smi
# vLLM logs
runcrate ssh llama-70b -- "tail -50 /root/vllm.log"
# Active requests
runcrate ssh llama-70b -- "curl -s localhost:8000/metrics | grep vllm_num_requests"
Option 3: Self-host with Ollama (simpler, quantized)
Ollama runs quantized models with a single command. Good for development and prototyping — not recommended for production throughput.
Deploy and set up
runcrate instances create --name llama-ollama --gpu RTX4090
runcrate ssh llama-ollama -- "curl -fsSL https://ollama.com/install.sh | sh"
Pull and serve a model
# Pull Llama 3.1 8B (Q4 quantized — fits easily in 24 GB)
runcrate ssh llama-ollama -- "ollama pull llama3.1:8b"
# Start the server on all interfaces
runcrate ssh llama-ollama -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"
Test it
runcrate instances info llama-ollama
curl http://<INSTANCE_IP>:11434/api/chat \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "Hello from Ollama."}],
"stream": false
}'
Ollama also exposes an OpenAI-compatible endpoint at /v1/chat/completions, so you can use the same OpenAI SDK pattern:
from openai import OpenAI
client = OpenAI(
base_url="http://<INSTANCE_IP>:11434/v1",
api_key="ollama",
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "What is Ollama?"}],
)
print(response.choices[0].message.content)
Limitations
- Quantized models (Q4/Q5) trade quality for memory efficiency. For production accuracy, use vLLM with FP16 or FP8.
- Ollama’s serving throughput is lower than vLLM — fine for single-user development, not for concurrent production traffic.
- Larger models (70B Q4) need an A100 80 GB even with quantization.
Benchmarks
Expected throughput for each model/GPU combination with vLLM, batch size 1, 2048-token output:
| Model | GPU | Tokens/sec (output) | Time to first token |
|---|
| Llama 3.1 8B | RTX 4090 | ~90–110 tok/s | ~50 ms |
| Llama 3.1 8B | A100 80 GB | ~120–150 tok/s | ~35 ms |
| Llama 4 Scout | A100 80 GB | ~60–80 tok/s | ~80 ms |
| Llama 3.1 70B | A100 80 GB | ~25–35 tok/s | ~150 ms |
| Llama 3.1 70B | 2x A100 40 GB | ~20–30 tok/s | ~200 ms |
| Llama 3.1 405B FP8 | 4x H100 | ~15–25 tok/s | ~300 ms |
Throughput scales with concurrent requests. At 8+ concurrent requests, vLLM’s continuous batching can push aggregate throughput 3–5x higher than single-request numbers.
Which approach to choose
| Approach | Best for | Cost | Setup time |
|---|
| Models API | Production apps, no infra to manage | Per token | 60 seconds |
| vLLM self-host | Custom serving, max throughput, data privacy | Per hour (GPU) | ~10 minutes |
| Ollama self-host | Development, prototyping, experimentation | Per hour (GPU) | ~5 minutes |
Start with the Models API if you want to ship today. Move to vLLM self-hosting when you need dedicated throughput, custom context lengths, or want to keep all data on your own infrastructure.
Cleanup
When you’re done with self-hosted instances:
runcrate instances delete llama-8b
runcrate instances delete llama-70b
runcrate instances delete llama-scout
runcrate instances delete llama-405b
runcrate instances delete llama-ollama