Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Run Ollama on a dedicated cloud GPU instead of your local machine. Faster inference, larger models, and a shared endpoint for your team.
1. Deploy and install
runcrate instances create --name ollama --gpu RTX4090
runcrate instances status ollama
runcrate ssh ollama -- "curl -fsSL https://ollama.com/install.sh | sh"
2. Start the server
runcrate ssh ollama -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"
3. Pull models
runcrate ssh ollama -- "ollama pull llama3.1:8b"
runcrate ssh ollama -- "ollama pull qwen2.5:7b"
runcrate ssh ollama -- "ollama list"
4. Test the API
runcrate instances info ollama
curl http://<INSTANCE_IP>:11434/api/chat \
-d '{
"model": "llama3.1:8b",
"messages": [{"role": "user", "content": "What is Ollama?"}],
"stream": false
}'
5. Use the OpenAI-compatible endpoint
from openai import OpenAI
client = OpenAI(
base_url="http://<INSTANCE_IP>:11434/v1",
api_key="ollama",
)
response = client.chat.completions.create(
model="llama3.1:8b",
messages=[{"role": "user", "content": "Explain LoRA in one paragraph."}],
)
print(response.choices[0].message.content)
6. Larger models on A100
For 70B+ models, use an A100 80 GB:
runcrate instances create --name ollama-big --gpu A100
runcrate ssh ollama-big -- "curl -fsSL https://ollama.com/install.sh | sh"
runcrate ssh ollama-big -- "OLLAMA_HOST=0.0.0.0 nohup ollama serve > /root/ollama.log 2>&1 &"
runcrate ssh ollama-big -- "ollama pull llama3.1:70b"
Monitoring
runcrate ssh ollama -- nvidia-smi
runcrate ssh ollama -- "tail -20 /root/ollama.log"
runcrate ssh ollama -- "ollama ps"
Tips
- Ollama quantizes models by default (Q4). For higher quality, use
:fp16 tags if VRAM allows.
- The first request after pulling a model is slower — Ollama loads into GPU memory on demand.
- For production workloads with high concurrency, use vLLM instead.
Cleanup
runcrate instances delete ollama
runcrate instances delete ollama-big