At a virtual event this morning, AMD CEO Lisa Su unveiled the company’s latest and much-anticipated server products: the new Milan-X CPU, which leverages AMD’s new 3D V-Cache technology; and its new Instinct MI200 GPU, which provides up to 220 compute cores across two Infinity Fabric-connected dies, delivering an astounding 47.9 peak double-precision teraflops.
“We’re in a high-performance computing megacycle, driven by the growing need to deploy additional compute performance delivered more efficiently and at ever-larger scale to power the services and devices that define modern life,” said Su.
AMD’s new third-generation Epyc CPU with AMD 3D V-Cache, codenamed Milan-X, is the company’s first server CPU with 3D chiplet technology. The processors have three times the L3 cache compared to standard Milan processors. In Milan, each complex core die (CCD) had 32 megabytes of cache; Milan-X adds 64 megabytes of 3D stacked cache on top for a total of 96 megabytes per CCD. With eight CCDs, that adds up to 768 megabytes of L3 cache. Adding in L2 and L1 cache, there is a total of 804 megabytes of cache per socket.
Milan-X is built on the same Zen 3 cores as Milan, and will also have a max core count of 64 total cores. The enhanced processors are compatible with existing platforms after a BIOS upgrade.
Milan-X with 3D V-Cache employs a hybrid bond plus through silicon vias approach, providing more than 200 times the interconnect density of 2D chiplets and more than 15 times the density compared to the existing 3D stacking solutions, according to AMD. The die-to-die interface uses a direct copper-copper bond with no solder bumps to improve thermals, transistor density and interconnect pitch.
AMD is reporting a 50 percent performance improvement for Milan-X on targeted technical computing workloads compared to Milan processors. The chipmaker demonstrated Milan-X’s performance speedup on an EDA workload, running Synopsys’ verification solution VCS. A 16-core Milan-X with AMD’s 3D V-Cache delivered 66 percent faster RTL verification compared to the standard Milan without V-Cache. VCS is used by many of the world’s top semiconductor companies to catch defects early in the development process before a chip is committed to silicon.
Microsoft Azure is the first announced customer for Milan-X, with upgraded HBv3 instances in preview today, and a planned refresh on the way for its entire HBv3 deployment. Traditional OEM and ODM server partners Dell Technologies, HPE, Lenovo, and Supermicro are preparing Milan-X products for the first quarter of 2021. Named ISV ecosystem partners include Altair, Ansys, Cadence, Siemens and Synopsys.
Manufactured on TSMC’s 6nm process, the MI200 is the world’s first multichip GPU, designed to maximize compute and data throughput in a single package. The MI200 series contains two CDNA 2 GPU dies harnessing 58 billion transistors. It features up to 220 compute units and 880 second-generation matrix cores. Eight stacks of HBM2e memory provide a total 128 gigabytes memory at 3.2 TB/s, four times more capacity and 2.7 times more bandwidth than the MI100. Connecting the two CDNA2 dies are Infinity Fabric links running at 25 Gbps for a total of 400 GB/s of bidirectional bandwidth.
The MI200 accelerator – with up to 47.9 peak double-precision teraflops – ostensibly answers the question, what if a chip designer dramatically optimized the GPU architecture for double-precision (FP64) performance? The MI250X ramps up peak double-precision 4.2 times over the MI100 in one year (47.9 teraflops versus 11.5 teraflops). By comparison, AMD pointed out that Nvidia grew its traditional double-precision FP64 peak performance for its server GPUs 3.7 times from 2014 until 2020. In a side by side comparison, the MI200 OAM is nearly five times faster than Nvidia’s A100 GPU in peak FP64 performance, and 2.5 times faster in peak FP32 performance.
Further, the Instinct MI250X delivers 47.9 teraflops of peak single-precision (FP32) performance and provides 383 teraflops of peak theoretical half-precision (FP16) for AI workloads. That dense computational capability doesn’t come without a power cost. The top of stack part, the OAM MI250X, consumes up to 560 watts, while air-cooled and other configurations will require somewhat less power. However, remember you’re essentially getting two GPUs in one package with that 500-560 TDP, and based on some of the disclosed system specs (like Frontier), the flops-per-watt targets are impressive.
During this morning’s launch event, Forrest Norrod, senior vice president and general manager of the datacenter and embedded solutions business group at AMD, showed head-to-head comparisons for the MI200 OAM versus Nvidia’s A100 (80GB) GPU on a range of HPC applications. In AMD testing, a single-socket 3rd gen AMD Eypc server with one AMD Instinct MI250X OAM 560 watt GPU achieved a median score of 42.26 teraflops on the High Performance Linpack benchmark.
Norrod also showed a competitive comparison of the MI200 OAM versus the Nvidia A100 (80GB) on the molecular simluation code LAMMPS, running a combustion simulation of a hydrocarbon molecule. In the timelapse of the simulation, four MI250X 560 watt GPUs can be seen completing the job in less than half the time of four A100 SXM 80GB 400 watt GPUs.
The MI200 accelerators introduce the third-generation AMD Infinity Fabric technology. Up to eight Infinity Fabric links connect the AMD Instinct MI200 with 3rd generation Epyc Milan CPUs and other GPUs in the node to enable unified CPU/GPU memory coherency.
AMD is also introducing its Elevated Fanout Bridge (EFB) technology. “Unlike substrate embedded silicon bridge architectures, EFB enables use of standard substrates and assembly techniques, providing better precision, scalability and yields while maintaining high performance,” said Norrod.
Three form factors were announced for the new MI200 series: the MI250X and MI250, available in an open-hardware compute accelerator module or OCP Accelerator Module (OAM) form factor; and a PCIe card form factor, the AMD Instinct MI210, that will be forthcoming in OEM servers.
The AMD MI250X accelerator is currently available from HPE in the HPE Cray EX Supercomputer. Other MI200 series accelerators, including the PCIe form factor, are expected in Q1 2022 from server partners, including ASUS, ATOS, Dell Technologies, Gigabyte, HPE, Lenovo and Supermicro.
The MI250X accelerator will be the primary computational engine of the upcoming exascale supercomputer Frontier, currently being installed at the DOE’s Oak Ridge National Laboratory. Each of Frontier’s 9,000+ nodes will include one “optimized 3rd Gen AMD Epyc CPU” – not Milan-X – linked to four AMD MI250X accelerators over AMD’s coherent Infinity Fabric. With a promised performance target of >1.5 peak double-precision exaflops, Frontier could achieve greater than 1.72 exaflops peak just owing to its GPUs (9,000 x 4 x 95.7 teraflops).
As we detailed recently, the MI200 will be powering three giant systems on three continents. In addition to Frontier, expected to be the United States’ first exascale computer coming online next year, the MI200 was selected for the European Union’s pre-exascale LUMI system and Australia’s petascale Setonix system.
“The adoption of Milan has signficantly outpaced Rome as our momentum grows,” said Su. Looking ahead on the roadmap, the next-gen “Genoa” Epyc platform will have up to 96 high-performance 5nm “Zen 4” cores, and will support next-generation memory and IO capabilities DDR5, PCIe Gen 5 and CXL. Genoa is now sampling to customers with production and launch anticipated next year, AMD said.
“We’ve worked with TSMC to optimize 5nm for high performance computing,” said Su. “[The new process] offers twice the density, twice the power efficiency and 1.25x the performance of the 7nm process we’re using in today’s products.”
Su also unveiled a new version of Zen 4 for cloud native computing, called “Bergamo.” Bergamo features up to 128 high performance “Zen 4 C” cores, and will come with the other features of Genoa: DDR5, PCIe Gen 5, CXL 1.1, and the full suite of Infinity Guard security features. Further, it is socket compatible with Genoa with the same Zen 4 instruction set. Bergamo is on track to start shipping in the first half of 2023, Su said.
“Our investment in multi-generational CPU core roadmaps combined with advanced process and packaging technology enables us to deliver leadership across general purpose technical computing and cloud workloads,” said Su. “You can count on us to continue to push the envelope in high-performance computing.”
AMD also announced version 5.0 of ROCm, its open software platform that supports environments across multiple accelerator vendors and architectures. “With ROCm 5.0, we’re adding support and optimization for the MI200, expanding ROCm support to include the Radeon Pro W6800 workstation GPUs, and improving developer tools that increase end user productivity,” said AMD’s Corporate Vice President, GPU Platforms, Brad McCredie in a media briefing last week.
The company also has a new Infinity Hub, where developers can access documentation, tools and education materials for HIP and OpenMP, and system administrators and scientists can download containerized HPC apps and ML frameworks that are optimized and supported on AMD platforms.
Commenting on today’s news raft, market watcher Addison Snell, CEO of Intersect360 Research, said, “AMD has set the new bar for performance in HPC – in CPU, in GPU, and in packaging both together. Either Milan-X or MI200 makes a statement on its own – multiple statements, based on the benchmarks. Having coherent memory over Infinity Fabric is a game-changer that neither Intel nor Nvidia is going to be able to match soon.”