Computer makers are unveiling a total of 50 servers with Nvidia’s A100 graphics processing units (GPUs) to power AI, data science, and scientific computing applications. The first GPU based on the Nvidia Ampere architecture, the A100 is the company’s largest leap in GPU performance to date, with features such as the ability for one GPU to be partitioned into seven separate GPUs as needed, Nvidia said. The company made the announcement ahead of the ISC High Performance online event, which is dedicated to high-performance computing.
Unveiled in May, the A100 GPU has 54 billion transistors (the on-off switches that are the building blocks of all things electronic) and can execute five petaflops of performance, or about 20 times more than the previous-generation chip Volta. This means central processing unit (CPU) servers that cost $20 million and take up 22 racks can be replaced by new servers that cost $3 million and take up just four GPU-based server racks, said Paresh Kharya, director of product marketing for accelerated computing at Nvidia, in a press briefing.
The systems are coming from computer makers that include Asus, Atos, Cisco, Dell, Fujitsu, Gigabyte, Hewlett Packard Enterprise, Inspur, Lenovo, One Stop Systems, Quanta/QCT, and Supermicro. Server availability varies, with 30 systems expected this summer and over 20 more by the end of the year, Kharya said.
Integrating Mellanox
The new machines include new InfiniBand interconnect technology from Mellanox, which Nvidia paid $7 billion to acquire in 2019. Nvidia integrated Mellanox technology with the A100 to create Selene, which Nvidia bills as a top 10 supercomputer and the world’s most energy-efficient computer. Selene was designed in less than a month and provides over 1 exaflop of AI processing. Kharya said supercomputers like Selene will help Nvidia penetrate further into the world’s top supercomputers.
Last year, Nvidia’s graphics processing units (GPUs) were part of 125 of the top 500 supercomputers in the world, according to ISC. If you count the supercomputers with Mellanox InfiniBand technology, the number is more than 300. The list is expected to grow even larger in 2020.
“If you look at the top 500 list, the reason why Nvidia is so successful in supercomputing is because scientific computing has changed,” Kharya said. “We’ve entered a new era, one that has expanded beyond traditional modeling and simulation workloads to include AI, data analytics, edge screening, and big data visualization.”
Kharya said Mellanox interconnect chips power the world’s leading weather forecast supercomputers. Weather and climate models are both compute- and data-intensive. Forecast quality depends on the model’s complexity and level of resolution. And supercomputer performance depends on interconnect technology to move data quickly across different computers.
“It’s exciting to have the best compute on one side and the best network on the other, and now we can start to combine those technologies together and start building amazing things,” said Gilad Shainer, senior vice president at Nvidia, in a press briefing.
Customers using Mellanox include the Spanish Meteorological Agency, the China Meteorological Administration, the Finnish Meteorological
Institute, NASA, and the Royal Netherlands Meteorological Institute.
The Beijing Meteorological Service has selected 200 Gigabit HDR InfiniBand interconnect technology to accelerate its new supercomputing platform, which will be used to enhance weather forecasting, improve climate and environmental research, and serve the weather forecasting information needs of the 2022 Winter Olympics in Beijing.
Nvidia said it has been able to run the RAPIDS suite of open source data science software in just 14.5 minutes, breaking the previous record of performance by 19.5 times. (A rival central processing unit (CPU) system does the same task in 4.7 hours.) Nvidia owes its gains to its new Nvidia DGX A100 systems using the Nvidia A100 artificial intelligence GPU chip. The 16 Nvidia DGX A100 systems used in the benchmark test had a total of 128 Nvidia A100 GPUs with Mellanox interconnects.
Nvidia also unveiled the Nvidia Mellanox UFM Cyber-AI platform, which minimizes downtime in InfiniBand datacenters by harnessing AI-powered analytics to detect security threats and operational issues.
This extension of the UFM platform product portfolio — which has managed InfiniBand systems for nearly a decade — applies AI to learn a datacenter’s operational cadence and network workload patterns. It draws on both real-time and historic telemetry and workload data. Against this baseline, it tracks the system’s health and network modifications and detects performance problems.
The new platform provides alerts of abnormal system and application behavior and potential system failures and threats, as well as performing corrective actions. It delivers security alerts in cases of attempted system hacking, such as cryptocurrency mining. The result is reduced datacenter downtime — which typically costs more than $300,000 an hour, according to research by the ITIC 2020 report.
Fighting the coronavirus
Kharya said Nvidia’s scientific computing platform has been enlisted in the fight against COVID-19. In genomics, Oxford Nanopore Technologies was able to sequence the virus genome in just seven hours using Nvidia GPUs. For infection analysis and prediction, the Nvidia RAPIDS team has helped create the GPU-accelerated Plotly’s Dash, a data visualization tool that enables clearer insights into real-time infection rate analysis.
Nvidia’s tools can be used to predict the availability of hospital resources across the U.S. In structural biology, the U.S. National Institutes of Health and the University of Texas, Austin are using GPU-accelerated software CryoSPARC to reconstruct the first 3D structure of the virus protein using cryogenic electron microscopy.
In treatment, Nvidia worked with the National Institutes of Health and built AI to accurately classify COVID-19 infection based on lung scans so doctors can devise efficient treatment plans. In drug discovery, Oak Ridge National Laboratory ran the Scripps Research Institute’s AutoDock on the GPU accelerated Summit Supercomputer to screen a billion potential drug combinations in just 12 hours.
In robotics, startup Kiwi is building robots to deliver medical supplies autonomously. And in edge detection, Whiteboard Coordinator built an AI system to automatically measure elevated body temperatures, screening well over 2,000 health care workers per hour. In total, Nvidia accelerates more than 700 high-performance computing applications.