This morning, everybody found out what CEO Jensen Huang was cooking—an Ampere-powered successor to the Volta-powered DGX-2 deep learning system.
Yesterday, we described mysterious hardware in Huang’s kitchen as likely “packing a few Xeon CPUs” in addition to the new successor to the Tesla v100 GPU. Egg’s on our face for that one—the new system packs a pair of AMD Epyc 7742 64-core, 128-thread CPUs, along with 1TiB of RAM, a pair of 1.9TiB NVMe SSDs in RAID1 for a boot drive, and up to four 3.8TiB PCIe4.0 NVMe drives in RAID 0 as secondary storage.
Goodbye Intel, hello AMD
Technically, it shouldn’t come as too much of a surprise that Nvidia would tap AMD for the CPUs in its flagship machine-learning nodes—Epyc Rome has been kicking Intel’s Xeon server CPU line up and down the block for quite a while now. Staying on the technical side of things, Epyc 7742’s support for PCIe 4.0 may have been even more important than its high CPU speed and massive core/thread count.
GPU-based machine-learning frequently bottlenecks on storage, not CPU. The M.2 and U.2 interfaces used by the DGX A100 each use 4 PCIe lanes, which means the shift from PCI Express 3.0 to PCI Express 4.0 means doubling the available storage bandwidth from 128GB/sec to 256GB/sec per individual SSD.
There may have been a little bit of politics lurking behind the decision to change CPU vendors, as well. AMD might be Nvidia’s biggest competitor in the relatively low-margin consumer-graphics market, but Intel is muscling in on the data center side of the market. For now, Intel’s offerings in discrete GPUs are mostly vapor—but we know Chipzilla’s got much bigger and grander plans as it shifts its focus from the moribund consumer-CPU market to all things data center.
The Intel DG1 itself—which is the only real hardware we’ve seen yet—has leaked benchmarks that have it competing with the integrated Vega GPU from a Ryzen 7 4800U. But Nvidia might be more concerned about the Xe HP 4-tile GPU, whose 2048 EUs (execution units) might offer up to 36TFLOPS—which would at least be in the same ballpark as the Nvidia A100 GPU powering the DGX unveiled today.
DGX, HGX, SuperPOD, and Jetson
The DGX A100 was the star of today’s announcements—it’s a self-contained system featuring eight A100 GPUs, with 40TiB GPU memory apiece. The US Department of Energy’s Argonne National Lab is already using one DGX A100 for COVID-19 research. The system’s nine 200Gbps Mellanox interconnects make it possible to cluster multiple DGX A100s—but those whose budget won’t support lots of $200,000 GPU nodes can make do by partitioning the A100 GPUs into up to 56 instances apiece.
For those who do have the budget to buy and cluster oodles of DGX A100 nodes, they’re also available in an HGX—Hyperscale Data Center Accelerator—format. Nvidia says that a “typical cloud cluster” comprised of its earlier DGX-1 nodes along with 600 separate CPUs for inference training could be replaced by five DGX A100 units, capable of handling both workloads. This would condense the hardware down from 25 racks to one, the power budget from 630kW to 28kW, and the cost from $11 million to $1 million.
If the HGX still doesn’t sound big enough, Nvidia has also released reference architecture for its SuperPOD—no relation to Plume. Nvidia’s A100 SuperPOD connects 140 DGX A100 nodes and 4PB of flash storage over 170 Infiniband switches, and it offers 700 petaflops of AI performance. Nvidia has added four of the SuperPODs to its own SaturnV supercomputer, which—according to Nvidia, at least—makes SaturnV the fastest AI supercomputer in the world.
Finally, if the data center’s not your thing, you can have an A100 in your edge computing instead, with Jetson EGX A100. For those not familiar, Nvidia’s Jetson single-board platform can be thought of as a Raspberry Pi on steroids—they’re deployable in IoT scenarios but bring significant processing power to a small form factor that can be ruggedized and embedded in edge devices such as robotics, health care, and drones.