Reference
What it takes to run 50 GPUs as one.
Compute, interconnect, storage, parallelism, power, failure. The vocabulary and the math behind running GPU clusters at production scale.
Compute
FP8, tensor cores, MIG.
8 terms
Interconnect
NVLink, NVSwitch, InfiniBand.
9 terms
Collectives
All-reduce, NCCL.
6 terms
Storage
Parallel filesystems, GPUDirect Storage.
7 terms
Parallelism
FSDP, ZeRO, tensor parallel.
6 terms
Power and Thermals
DLC, watts per rack, thermal stragglers.
5 terms
Failure at Scale
Stragglers, blast radius, drain-and-replace.
6 terms
Orchestration
Slurm, Kubernetes, gang scheduling.
6 terms
We built this because we operate at this scale.
Factryze runs autonomous agents across GPU fleets. If your team is hitting the problems described here, we'd like to hear about it.