Skip to main content

Documentation Index

Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt

Use this file to discover all available pages before exploring further.

Run any Llama model on a dedicated GPU — from Llama 3.1 8B on an RTX 4090 to Llama 3.1 405B across four H100s. Three deployment paths: the Models API (zero infrastructure), vLLM self-hosting (full control), and Ollama (fast prototyping).

Which Llama model to pick

ModelParametersGPUVRAM neededApprox. cost
Llama 3.1 8B Instruct8BRTX 4090 (24 GB)~16 GB (FP16)~$0.35/hr
Llama 4 Scout17B active (109B total MoE)A100 80 GB~70 GB (BF16)~$1.60/hr
Llama 3.1 70B Instruct70BA100 80 GB~70 GB (BF16)~$1.60/hr
Llama 3.1 70B Instruct70B2x A100 40 GB~35 GB each~$2.40/hr
Llama 3.1 405B Instruct (FP8)405B4x H100 80 GB~50 GB each~$10.00/hr
Rule of thumb: a model needs roughly 2x its parameter count in bytes of VRAM at FP16, or 1x at FP8/INT8. When in doubt, go one tier up — you can always downgrade later.

Option 1: Models API (easiest — no GPU needed)

The fastest path. Hit the Runcrate Models API directly and pay per token. No instance to manage, no vLLM to install, no GPU to provision.

curl

curl https://api.runcrate.ai/v1/chat/completions \
  -H "Authorization: Bearer rc_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "messages": [
      {"role": "user", "content": "Explain mixture-of-experts in two sentences."}
    ],
    "max_tokens": 256
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runcrate.ai/v1",
    api_key="rc_live_YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "What is Llama 4 Scout?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

TypeScript (OpenAI SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.runcrate.ai/v1",
  apiKey: "rc_live_YOUR_API_KEY",
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-4-Scout-17B-16E-Instruct",
  messages: [{ role: "user", content: "What is Llama 4 Scout?" }],
  max_tokens: 256,
});

console.log(response.choices[0].message.content);
Works with any model in the catalog — swap the model string and go.

Option 2: Self-host with vLLM (full control)

Run your own OpenAI-compatible endpoint on a dedicated GPU. You control the model, the context length, the quantization, and the scaling.

Deploy Llama 3.1 8B (single RTX 4090)

runcrate instances create --name llama-8b --gpu RTX4090
Wait for deployment:
runcrate instances status llama-8b
Install vLLM and start serving:
runcrate ssh llama-8b -- "pip install vllm"

runcrate ssh llama-8b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Deploy Llama 3.1 70B (single A100 80 GB)

runcrate instances create --name llama-70b --gpu A100

runcrate ssh llama-70b -- "pip install vllm"

runcrate ssh llama-70b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Deploy Llama 4 Scout (single A100 80 GB)

runcrate instances create --name llama-scout --gpu A100

runcrate ssh llama-scout -- "pip install vllm"

runcrate ssh llama-scout -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Deploy Llama 3.1 405B FP8 (4x H100)

The 405B model requires tensor parallelism across multiple GPUs:
runcrate instances create --name llama-405b --gpu H100 --gpu-count 4

runcrate ssh llama-405b -- "pip install vllm"

runcrate ssh llama-405b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Test your endpoint

# Get the instance IP
runcrate instances info llama-70b

# Hit the API
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "What makes Llama open-weight?"}],
    "max_tokens": 256
  }'

Point your app at it

Once the server is running, point any OpenAI-compatible SDK at your instance:
from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:8000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Summarize the Llama 3.1 release."}],
)
print(response.choices[0].message.content)
import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://<INSTANCE_IP>:8000/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.1-70B-Instruct",
  messages: [{ role: "user", content: "Summarize the Llama 3.1 release." }],
});

console.log(response.choices[0].message.content);

Monitoring

# GPU memory and utilization
runcrate ssh llama-70b -- nvidia-smi

# vLLM logs
runcrate ssh llama-70b -- "tail -50 /root/vllm.log"

# Active requests
runcrate ssh llama-70b -- "curl -s localhost:8000/metrics | grep vllm_num_requests"

Option 3: Self-host with Ollama (simpler, quantized)

Ollama runs quantized models with a single command. Good for development and prototyping — not recommended for production throughput.

Deploy and set up

runcrate instances create --name llama-ollama --gpu RTX4090

runcrate ssh llama-ollama -- "curl -fsSL https://ollama.com/install.sh | sh"

Pull and serve a model

# Pull Llama 3.1 8B (Q4 quantized — fits easily in 24 GB)
runcrate ssh llama-ollama -- "ollama pull llama3.1:8b"

# Start the server on all interfaces
runcrate ssh llama-ollama -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"

Test it

runcrate instances info llama-ollama

curl http://<INSTANCE_IP>:11434/api/chat \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello from Ollama."}],
    "stream": false
  }'
Ollama also exposes an OpenAI-compatible endpoint at /v1/chat/completions, so you can use the same OpenAI SDK pattern:
from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "What is Ollama?"}],
)
print(response.choices[0].message.content)

Limitations

  • Quantized models (Q4/Q5) trade quality for memory efficiency. For production accuracy, use vLLM with FP16 or FP8.
  • Ollama’s serving throughput is lower than vLLM — fine for single-user development, not for concurrent production traffic.
  • Larger models (70B Q4) need an A100 80 GB even with quantization.

Benchmarks

Expected throughput for each model/GPU combination with vLLM, batch size 1, 2048-token output:
ModelGPUTokens/sec (output)Time to first token
Llama 3.1 8BRTX 4090~90–110 tok/s~50 ms
Llama 3.1 8BA100 80 GB~120–150 tok/s~35 ms
Llama 4 ScoutA100 80 GB~60–80 tok/s~80 ms
Llama 3.1 70BA100 80 GB~25–35 tok/s~150 ms
Llama 3.1 70B2x A100 40 GB~20–30 tok/s~200 ms
Llama 3.1 405B FP84x H100~15–25 tok/s~300 ms
Throughput scales with concurrent requests. At 8+ concurrent requests, vLLM’s continuous batching can push aggregate throughput 3–5x higher than single-request numbers.

Which approach to choose

ApproachBest forCostSetup time
Models APIProduction apps, no infra to managePer token60 seconds
vLLM self-hostCustom serving, max throughput, data privacyPer hour (GPU)~10 minutes
Ollama self-hostDevelopment, prototyping, experimentationPer hour (GPU)~5 minutes
Start with the Models API if you want to ship today. Move to vLLM self-hosting when you need dedicated throughput, custom context lengths, or want to keep all data on your own infrastructure.

Cleanup

When you’re done with self-hosted instances:
runcrate instances delete llama-8b
runcrate instances delete llama-70b
runcrate instances delete llama-scout
runcrate instances delete llama-405b
runcrate instances delete llama-ollama