Gradient Bucketing

A modern transformer is a pile of small tensors. The fact that the layer count is "70" hides the fact that the parameter count is in the thousands once you separate every weight matrix, every bias, every layer-norm scale, every embedding slice. Synchronous data-parallel training has to all-reduce all of them, every step. The single biggest knob between "training keeps up with the math" and "training is paying NCCL launch cost over and over" is how those gradient tensors get coalesced before they hit the wire.

The problem: thousands of small messages

A 70-layer transformer with attention, MLPs, and layer norms has on the order of one to two thousand parameter tensors when each weight, bias, and scale is counted separately. Without bucketing, the natural pattern in autograd is to fire one all-reduce as soon as each gradient lands, which means each step issues 100s to thousands of separate collectives. Each one pays the NCCL launch cost, the kernel dispatch, the CUDA stream synchronization, and the per-hop α latency floor (on the order of 10 µs on NVLink + NVSwitch, higher across InfiniBand). Multiply by a few thousand calls and you are paying milliseconds of pure overhead per step before any useful bytes have moved.

That is the latency-bound regime, and it is exactly where ring all-reduce is wasted. Ring is bandwidth-optimal at large message sizes; on a 4 KB gradient tensor it is paying the full ring traversal cost (2(P-1) sequential hops) for a payload that fits in a single packet. The fabric sits idle, the GPU sits idle, and step time stalls on a treadmill of tiny collectives.

How DDP buckets

PyTorch DDP solves this by coalescing gradients into fixed-size buckets. The default bucket_cap_mb=25 groups consecutive parameters in registration order until the bucket reaches 25 MB, then closes it and starts a new one. When a bucket fills, DDP fires a single all-reduce on the whole bucket. A model with 1.5 GB of gradients ends up with roughly 60 buckets, which means roughly 60 all-reduces per step instead of 1500.

The ordering choice matters. DDP registers parameters in reverse forward order, which is the order their gradients land during backward. The first bucket to fill (and therefore fire) holds the latest layers' gradients, which finish first because backward runs from output to input. The last bucket to fire holds the earliest layers' gradients. This is what makes gradient-comm overlap possible: while bucket N's all-reduce is in flight, the GPU is still computing backward for the layers feeding bucket N+1. See compute-comm overlap for the scheduling that makes this overlap actually save wall clock.

Why 25 MB

The default landed at 25 MB through empirical tuning on the V100 + NVLink generation, and it stuck because the trade-off it captures is fundamental, not hardware-specific. Smaller buckets (1 MB or 5 MB) keep firing in the latency-bound regime: every call still pays α, and the ring's bandwidth is mostly unused. Larger buckets (100 MB or 250 MB) waste the overlap window: a bucket cannot fire until its last gradient lands, so a 100 MB bucket sits waiting for many backward kernels to finish before any of its bytes start moving. The longer the wait, the less compute is left to overlap the all-reduce with.

25 MB hits the sweet spot on NVLink: large enough to push the call comfortably above the ring crossover (~256 KiB on 8x H100) and into the bandwidth-bound regime, small enough that backward fills the next bucket while the current one is in flight. On H100 + NVSwitch the 25 MB default still works, but it is not necessarily optimal. Some teams running very large transformers tune up to 50 or 100 MB once profiling shows the ring is not saturating its bandwidth ceiling on 25 MB calls. Speedups from sensible bucketing land in the 1.2 to 1.8× range over the unbucketed baseline, depending on parameter distribution and fabric.

What you can tune

The DDP knob is one constructor argument:

model = DistributedDataParallel(
    model,
    bucket_cap_mb=50,  # default 25
)

Profile first, change second. NCCL_DEBUG=INFO plus a step-time trace will tell you whether buckets are firing in the bandwidth-bound regime. If they are, leave it alone. If the ring is consistently underused at 25 MB on a model with very large layers, 50 MB is a reasonable next step.

The other DDP knob is find_unused_parameters=True, which is a different beast. It exists for models that conditionally skip parameters per step (early-exit, MoE, multi-task heads), and it works by issuing a separate all-reduce per parameter so DDP can detect which ones never received a gradient. That disables bucketing entirely. It is a foot-gun on any model that does not actually need it: turning it on for safety on a vanilla transformer drops you straight back into the unbucketed latency-bound regime.

FSDP does not bucket the same way. With FSDP each layer (or wrapped unit) issues its own all-gather of parameters during forward and its own reduce-scatter of gradients during backward. The wrap policy is the FSDP equivalent of bucket_cap_mb: a small wrap (per-layer) means many small all-gathers and reduce-scatters, a large wrap (per transformer block, or per N blocks) means fewer larger ones. The default per-layer wrap roughly mirrors what DDP gets from 25 MB bucketing. The trade-off is the same: too small wastes bandwidth on per-call latency, too large eats the overlap window. See all-gather vs reduce-scatter for the FSDP collective shape.

What this means in practice

The 25 MB default is fine for almost every model. If you have not profiled and seen the ring underused, do not touch it.
Tune up (50 to 100 MB) only for very large transformers where profiling shows individual buckets are not saturating fabric bandwidth. Keep an eye on overlap: bigger buckets stall the all-reduce until later in backward.
Tune down (or use find_unused_parameters=True) only when a model needs it. Both choices push you back toward the latency-bound regime.

The problem: thousands of small messages

How DDP buckets

Why 25 MB

What you can tune

What this means in practice

See also