Skip to main content

GPU Clusters

Runcrate Private Cloud deploys bare-metal GPU clusters tailored to your workload. Each cluster is single-tenant — dedicated hardware that only your team can access.

Cluster Configurations

SizeNodesGPUs (8 per node)Best For
Starter16128Fine-tuning large models, multi-node training
Growth32–64256–512Pre-training mid-size models, large-scale inference
Scale64–128512–1,024Frontier model training, massive parallel workloads
Custom128+1,024+Custom configurations for unique requirements

What’s Included

Every Private Cloud cluster includes:
  • Bare-metal servers — No virtualization overhead. Full hardware access with root.
  • NVIDIA GPUs — 8 GPUs per node with NVLink for intra-node communication.
  • InfiniBand networking — High-bandwidth, low-latency interconnect between nodes for distributed training.
  • High-performance storage — NVMe SSDs for fast data access during training.
  • Dedicated networking — Private network with no shared bandwidth.
  • 24/7 monitoring — Infrastructure health monitoring and alerting.

Cluster Architecture

Use Cases

Distributed Training

Train large language models, vision models, or multimodal models across hundreds of GPUs. InfiniBand ensures efficient gradient synchronization with minimal communication overhead.

Production Inference

Serve models at scale with predictable latency. Dedicated hardware means no cold starts and no resource contention.

Fine-Tuning at Scale

Run multiple fine-tuning jobs in parallel across your cluster. Full control over scheduling and resource allocation.

Research

Experiment with new architectures, training techniques, and scaling laws on dedicated infrastructure without worrying about availability or spot interruptions.

Software Stack

You have full control over the software stack. Common setups include:
  • Kubernetes (managed or self-managed)
  • Slurm for HPC-style job scheduling
  • Docker / Podman for containerized workloads
  • NVIDIA NCCL for multi-GPU communication
  • DeepSpeed, Megatron, FSDP for distributed training frameworks
Runcrate can assist with cluster setup and configuration. Managed Kubernetes and Slurm options are available.