Deploy Llama on a Cloud GPU

Run any Llama model on a dedicated GPU — from Llama 3.1 8B on an RTX 4090 to Llama 3.1 405B across four H100s. Three deployment paths: the Models API (zero infrastructure), vLLM self-hosting (full control), and Ollama (fast prototyping).

Which Llama model to pick

Model	Parameters	GPU	VRAM needed	Approx. cost
Llama 3.1 8B Instruct	8B	RTX 4090 (24 GB)	~16 GB (FP16)	~$0.35/hr
Llama 4 Scout	17B active (109B total MoE)	A100 80 GB	~70 GB (BF16)	~$1.60/hr
Llama 3.1 70B Instruct	70B	A100 80 GB	~70 GB (BF16)	~$1.60/hr
Llama 3.1 70B Instruct	70B	2x A100 40 GB	~35 GB each	~$2.40/hr
Llama 3.1 405B Instruct (FP8)	405B	4x H100 80 GB	~50 GB each	~$10.00/hr

Rule of thumb: a model needs roughly 2x its parameter count in bytes of VRAM at FP16, or 1x at FP8/INT8. When in doubt, go one tier up — you can always downgrade later.

Option 1: Models API (easiest — no GPU needed)

The fastest path. Hit the Runcrate Models API directly and pay per token. No instance to manage, no vLLM to install, no GPU to provision.

curl

curl https://api.runcrate.ai/v1/chat/completions \
  -H "Authorization: Bearer rc_live_YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-4-Scout-17B-16E-Instruct",
    "messages": [
      {"role": "user", "content": "Explain mixture-of-experts in two sentences."}
    ],
    "max_tokens": 256
  }'

Python (OpenAI SDK)

from openai import OpenAI

client = OpenAI(
    base_url="https://api.runcrate.ai/v1",
    api_key="rc_live_YOUR_API_KEY",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
    messages=[{"role": "user", "content": "What is Llama 4 Scout?"}],
    max_tokens=256,
)
print(response.choices[0].message.content)

TypeScript (OpenAI SDK)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.runcrate.ai/v1",
  apiKey: "rc_live_YOUR_API_KEY",
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-4-Scout-17B-16E-Instruct",
  messages: [{ role: "user", content: "What is Llama 4 Scout?" }],
  max_tokens: 256,
});

console.log(response.choices[0].message.content);

Works with any model in the catalog — swap the model string and go.

Option 2: Self-host with vLLM (full control)

Run your own OpenAI-compatible endpoint on a dedicated GPU. You control the model, the context length, the quantization, and the scaling.

Deploy Llama 3.1 8B (single RTX 4090)

runcrate instances create --name llama-8b --gpu RTX4090

Wait for deployment:

runcrate instances status llama-8b

Install vLLM and start serving:

runcrate ssh llama-8b -- "pip install vllm"

runcrate ssh llama-8b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Deploy Llama 3.1 70B (single A100 80 GB)

runcrate instances create --name llama-70b --gpu A100

runcrate ssh llama-70b -- "pip install vllm"

runcrate ssh llama-70b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Deploy Llama 4 Scout (single A100 80 GB)

runcrate instances create --name llama-scout --gpu A100

runcrate ssh llama-scout -- "pip install vllm"

runcrate ssh llama-scout -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Deploy Llama 3.1 405B FP8 (4x H100)

The 405B model requires tensor parallelism across multiple GPUs:

runcrate instances create --name llama-405b --gpu H100 --gpu-count 4

runcrate ssh llama-405b -- "pip install vllm"

runcrate ssh llama-405b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-405B-Instruct-FP8 \
  --tensor-parallel-size 4 \
  --max-model-len 16384 \
  --port 8000 \
  --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Test your endpoint

# Get the instance IP
runcrate instances info llama-70b

# Hit the API
curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3.1-70B-Instruct",
    "messages": [{"role": "user", "content": "What makes Llama open-weight?"}],
    "max_tokens": 256
  }'

Point your app at it

Once the server is running, point any OpenAI-compatible SDK at your instance:

from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:8000/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-70B-Instruct",
    messages=[{"role": "user", "content": "Summarize the Llama 3.1 release."}],
)
print(response.choices[0].message.content)

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "http://<INSTANCE_IP>:8000/v1",
  apiKey: "not-needed",
});

const response = await client.chat.completions.create({
  model: "meta-llama/Llama-3.1-70B-Instruct",
  messages: [{ role: "user", content: "Summarize the Llama 3.1 release." }],
});

console.log(response.choices[0].message.content);

Monitoring

# GPU memory and utilization
runcrate ssh llama-70b -- nvidia-smi

# vLLM logs
runcrate ssh llama-70b -- "tail -50 /root/vllm.log"

# Active requests
runcrate ssh llama-70b -- "curl -s localhost:8000/metrics | grep vllm_num_requests"

Option 3: Self-host with Ollama (simpler, quantized)

Ollama runs quantized models with a single command. Good for development and prototyping — not recommended for production throughput.

Deploy and set up

runcrate instances create --name llama-ollama --gpu RTX4090

runcrate ssh llama-ollama -- "curl -fsSL https://ollama.com/install.sh | sh"

Pull and serve a model

# Pull Llama 3.1 8B (Q4 quantized — fits easily in 24 GB)
runcrate ssh llama-ollama -- "ollama pull llama3.1:8b"

# Start the server on all interfaces
runcrate ssh llama-ollama -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"

Test it

runcrate instances info llama-ollama

curl http://<INSTANCE_IP>:11434/api/chat \
  -d '{
    "model": "llama3.1:8b",
    "messages": [{"role": "user", "content": "Hello from Ollama."}],
    "stream": false
  }'

Ollama also exposes an OpenAI-compatible endpoint at /v1/chat/completions, so you can use the same OpenAI SDK pattern:

from openai import OpenAI

client = OpenAI(
    base_url="http://<INSTANCE_IP>:11434/v1",
    api_key="ollama",
)

response = client.chat.completions.create(
    model="llama3.1:8b",
    messages=[{"role": "user", "content": "What is Ollama?"}],
)
print(response.choices[0].message.content)

Limitations

Quantized models (Q4/Q5) trade quality for memory efficiency. For production accuracy, use vLLM with FP16 or FP8.
Ollama’s serving throughput is lower than vLLM — fine for single-user development, not for concurrent production traffic.
Larger models (70B Q4) need an A100 80 GB even with quantization.

Benchmarks

Expected throughput for each model/GPU combination with vLLM, batch size 1, 2048-token output:

Model	GPU	Tokens/sec (output)	Time to first token
Llama 3.1 8B	RTX 4090	~90–110 tok/s	~50 ms
Llama 3.1 8B	A100 80 GB	~120–150 tok/s	~35 ms
Llama 4 Scout	A100 80 GB	~60–80 tok/s	~80 ms
Llama 3.1 70B	A100 80 GB	~25–35 tok/s	~150 ms
Llama 3.1 70B	2x A100 40 GB	~20–30 tok/s	~200 ms
Llama 3.1 405B FP8	4x H100	~15–25 tok/s	~300 ms

Throughput scales with concurrent requests. At 8+ concurrent requests, vLLM’s continuous batching can push aggregate throughput 3–5x higher than single-request numbers.

Which approach to choose

Approach	Best for	Cost	Setup time
Models API	Production apps, no infra to manage	Per token	60 seconds
vLLM self-host	Custom serving, max throughput, data privacy	Per hour (GPU)	~10 minutes
Ollama self-host	Development, prototyping, experimentation	Per hour (GPU)	~5 minutes

Start with the Models API if you want to ship today. Move to vLLM self-hosting when you need dedicated throughput, custom context lengths, or want to keep all data on your own infrastructure.

Cleanup

When you’re done with self-hosted instances:

runcrate instances delete llama-8b
runcrate instances delete llama-70b
runcrate instances delete llama-scout
runcrate instances delete llama-405b
runcrate instances delete llama-ollama

​Which Llama model to pick

​Option 1: Models API (easiest — no GPU needed)

​curl

​Python (OpenAI SDK)

​TypeScript (OpenAI SDK)

​Option 2: Self-host with vLLM (full control)

​Deploy Llama 3.1 8B (single RTX 4090)

​Deploy Llama 3.1 70B (single A100 80 GB)

​Deploy Llama 4 Scout (single A100 80 GB)

​Deploy Llama 3.1 405B FP8 (4x H100)

​Test your endpoint

​Point your app at it

​Monitoring

​Option 3: Self-host with Ollama (simpler, quantized)

​Deploy and set up

​Pull and serve a model

​Test it

​Limitations

​Benchmarks

​Which approach to choose

​Cleanup