Documentation Index
Fetch the complete documentation index at: https://runcrate.ai/docs/llms.txt
Use this file to discover all available pages before exploring further.
Train large models faster by distributing work across multiple GPUs on a single instance.
1. Deploy a multi-GPU instance
runcrate instances create --name multi-gpu --gpu A100 --gpu-count 4 --template ubuntu-train
runcrate instances status multi-gpu
2. Verify GPU topology
runcrate ssh multi-gpu -- "nvidia-smi"
runcrate ssh multi-gpu -- "python -c 'import torch; print(f\"GPUs: {torch.cuda.device_count()}\")'"
3. PyTorch DDP training
Upload and run your training script with torchrun:
runcrate cp ./train_ddp.py multi-gpu:/workspace/
runcrate ssh multi-gpu -- "cd /workspace && torchrun \
--nproc_per_node=4 \
--master_port=29500 \
train_ddp.py \
--batch-size 32 \
--epochs 10 \
--lr 1e-4"
4. DeepSpeed ZeRO (for larger models)
runcrate ssh multi-gpu -- "pip install deepspeed"
runcrate cp ./ds_config.json multi-gpu:/workspace/
runcrate cp ./train_deepspeed.py multi-gpu:/workspace/
runcrate ssh multi-gpu -- "cd /workspace && deepspeed \
--num_gpus=4 \
train_deepspeed.py \
--deepspeed_config ds_config.json"
5. Monitor training
runcrate ssh multi-gpu -- nvidia-smi
runcrate ssh multi-gpu -- "nvidia-smi topo -m"
runcrate ssh multi-gpu -- "tail -30 /workspace/output/training.log"
6. Download results
runcrate cp multi-gpu:/workspace/output/ ./training-output/
Tips
- DDP scales linearly with GPU count for data-parallel workloads. 4 GPUs = ~3.8x throughput.
- DeepSpeed ZeRO-3 shards model parameters, gradients, and optimizer states across GPUs — use it when the model does not fit on a single GPU.
- Use
--gradient_accumulation_steps to simulate larger batch sizes without increasing per-GPU memory.
Cleanup
runcrate instances delete multi-gpu