Building a GPU Data Center from Scratch: Lessons Learned
2025-09-15
When we broke ground on the first NewMachine facility in Secaucus in 2018, we thought we knew what we were doing. Viktor had designed infrastructure for some of the most demanding HPC organizations. Our architects had built data centers for major cloud providers. And yet, the GPU-specific requirements created challenges that no amount of general data-center experience fully prepared us for.
Lesson one: power density matters more than total power. GPU training nodes — especially 8xH100 DGX-class systems — draw significantly more power per rack unit than typical cloud workloads. Our first facility was designed for 20 kW per rack. Within a year, clients were requesting 60 kW. We retrofitted with direct liquid cooling, but the lesson was clear: design for the most demanding GPU density you can imagine, then double it.
Lesson two: cooling is a first-class engineering problem. Air cooling is insufficient for modern GPU clusters. In our Chicago facility, we deployed rear-door heat exchangers and discovered they could not keep pace during sustained multi-day training runs. We solved it by implementing direct-to-chip liquid cooling with dedicated coolant distribution units per row. The thermal overhead dropped by 40%, and GPU throttling incidents dropped to zero.
Lesson three: the NOC is the product. Clients do not interact with our InfiniBand fabric or our cooling systems — they interact with the people who answer the phone when a GPU node drops out of a training run at 3 AM. Our NOC engineers are trained on ML training workflows, distributed checkpoint recovery, and NCCL debugging. They know that a node failure during hour 47 of a 72-hour training run is not the same as a routine hardware ticket, and they triage accordingly.