Deploy Qwen on a Cloud GPU

Run Qwen models on your own GPU. Qwen3 introduces a hybrid thinking mode — the model can reason step-by-step or answer directly, controlled via system prompt.

GPU requirements

Model	GPU	VRAM needed	Approx. cost
Qwen2.5-7B-Instruct	RTX 4090 (24 GB)	~14 GB	~$0.35/hr
Qwen3-32B	A100 80 GB	~64 GB	~$1.60/hr
Qwen2.5-72B-Instruct	A100 80 GB	~72 GB	~$1.60/hr
Qwen3-235B-A22B (MoE)	2x H100 80 GB	~60 GB each	~$5.00/hr

Deploy Qwen2.5 7B (RTX 4090)

runcrate instances create --name qwen-7b --gpu RTX4090
runcrate instances status qwen-7b

runcrate ssh qwen-7b -- "pip install vllm"

runcrate ssh qwen-7b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --max-model-len 8192 \
  --port 8000 --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Deploy Qwen3 32B (A100)

runcrate instances create --name qwen3-32b --gpu A100
runcrate ssh qwen3-32b -- "pip install vllm"

runcrate ssh qwen3-32b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-32B \
  --max-model-len 8192 \
  --port 8000 --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Deploy Qwen3 235B MoE (2x H100)

runcrate instances create --name qwen3-235b --gpu H100 --gpu-count 2
runcrate ssh qwen3-235b -- "pip install vllm"

runcrate ssh qwen3-235b -- "nohup python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen3-235B-A22B \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --trust-remote-code \
  --port 8000 --host 0.0.0.0 \
  > /root/vllm.log 2>&1 &"

Test the endpoint

runcrate instances info qwen3-32b

curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-32B",
    "messages": [{"role": "user", "content": "Compare transformer and mamba architectures."}],
    "max_tokens": 512
  }'

Enable thinking mode

Add /think to the system prompt for step-by-step reasoning, or /no_think for direct answers:

curl http://<INSTANCE_IP>:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-32B",
    "messages": [
      {"role": "system", "content": "/think"},
      {"role": "user", "content": "What is 27 * 43?"}
    ],
    "max_tokens": 1024
  }'

Cleanup

runcrate instances delete qwen-7b
runcrate instances delete qwen3-32b
runcrate instances delete qwen3-235b

​GPU requirements

​Deploy Qwen2.5 7B (RTX 4090)

​Deploy Qwen3 32B (A100)

​Deploy Qwen3 235B MoE (2x H100)

​Test the endpoint

​Enable thinking mode

​Cleanup