Direct Liquid Cooling

Air can carry away roughly 30 kW per standard 42U rack before the math stops working. An H100 SXM5 dissipates 700 W. Eight of them in an HGX node at maximum load is 5.6 kW just from the GPUs, before CPUs, NICs, NVMe, or fans. Stack four such nodes and the air requirements alone exceed what the room can deliver. Direct liquid cooling is the way out: take heat off the die through a cold plate and push it into a closed water loop that exits the rack.

The two-loop pattern

A modern DLC deployment runs two coolant loops, separated by a coolant distribution unit (CDU). The secondary loop is the dirty-water side: it touches the cold plates over the GPUs, picks up heat at roughly 25 °C and returns at roughly 45 °C. The primary loop is the facility side: chilled or condenser water from the building, often shared across multiple racks. The CDU is a plate-frame heat exchanger that lets the loops swap heat without mixing fluids, plus a pump and a filter.

The two-loop separation is not a luxury. The secondary loop is closed and treated; the primary loop is whatever the building gives you, with whatever minerals, microorganisms, and pH it happens to have. Putting facility water directly into a manifold next to a GPU package is a recipe for galvanic corrosion, scale, and biological growth on cold-plate channels measured in millimeters. The CDU is the gasket between two reliability domains.

Why the cooling envelope sets the rack ceiling

Air cooling fails past 30 kW for a simple reason: the air is the heat-transfer fluid, and the room can only deliver so much air at so much ΔT through a rack of fixed dimensions. CFM (cubic feet per minute) and ΔT (the temperature rise across the rack) bound the watts you can carry. Past about 30 kW the curves stop intersecting at any combination of fan speed, supply temperature, and floor design that a real datacenter can build.

DLC dodges the constraint because water carries roughly 3,500× more heat per liter at the same temperature rise, and is roughly 25× as thermally conductive as air. A 1 cm² channel of water moving at modest speed can carry away the heat of a 700 W die and barely warm. The same heat in air requires a fan curve and a floor design that does not fit inside a rack envelope.

This is why NVL72 exists in the form it does: NVIDIA's reference design reaches ~120 kW per rack because they assumed DLC and built the chassis around manifold ports, not air filters. Air-cooled NVL72 is not a thing and never will be. See DLC and Watts per Rack for the density numbers that fall out.

What changes operationally

Leaks become a class of failure. Air-cooled racks fail open: a fan stops, the GPU thermal-throttles, you get warned, you replace the fan. DLC racks fail wet: a fitting loosens, a hose nicks, a quick-disconnect leaks during a maintenance event. Real DLC deployments include drip trays, leak detection on the bottom of every rack, and quick-disconnect couplers that seal both sides on disconnect. Reliability engineering is no longer just about parts; it is about pipework.

The CDU becomes a single-point-of-failure unless redundant. A CDU pump that quits takes a whole zone of racks offline within seconds, because the GPUs throttle on temperature within milliseconds and refuse to clear until coolant flow returns. Production DLC sites run CDUs in N+1 or 2N configurations and instrument supply pressure plus return temperature on every rack.

Chemistry maintenance enters the runbook. Treated secondary water must be tested periodically for pH, conductivity, and biological load. Glycol percentage matters in cold climates. Filters clog. None of these tasks existed in an air-cooled facility, and the operators who ran the previous generation of fleets are learning them from cooling vendors and from each other.

Practical guidance

Treat the CDU as production-critical infrastructure. N+1 minimum on the pump side. Instrument supply temperature, return temperature, and flow on every rack.
Run leak detection on every rack as a first-class signal. A leak that goes 30 minutes without alerting will damage the floor, not just one node.
Match secondary loop temperature to the GPU spec. Most modern GPU cold-plates expect 25-32 °C supply; cooler is fine, hotter eats your headroom for thermal stragglers.
Keep a manual valve between every rack and the manifold. If you cannot isolate a single rack without taking the row down, recovery from any single-rack event becomes a cluster event.

DLC is not optional at modern density. Once the decision is made to deploy NVL72-class racks, every operational practice that follows assumes water inside the rack. The air-cooled era is finished for training infrastructure.

The two-loop pattern

Why the cooling envelope sets the rack ceiling

What changes operationally

Practical guidance

See also