Baidu Unveils Kunlun II AI Chip: Rival for Nvidia A100

Baidu spun off its semiconductor design business into an independent company back in June, reports Data Centre Dynamics. Kunlun Chip Technology Co. values its business at around $2 billion

Kunlun Chip, a wholly owned subsidiary of Chinese high-tech giant Baidu, said this week that it had started volume production of its Kunlun II processor for AI applications. The new AI chip is based on the 2nd generation XPU microarchitecture, is made using a 7 nm process technology, and promises to offer two to three times higher performance than its predecessor.

Designed for cloud, edge, and autonomous vehicles applications, the 1st generation Kunlun K200 processor announced three years ago offers around 256 INT8 TOPS performance, around 64 TOPS INT/FP16 performance, and 16 INT/FP32 TOPS performance at 150 Watts.

If the claims of about 2-3 times higher performance of the Kunlun II chip over its predecessor are correct, then the new chip can provide from 512 to 768 INT8 TOPS, 128 – 192 INT/FP16 TOPS, and 32 – 48 INT/FP32 TOPS throughput. For comparison, Nvidia’s A100 offers 19.5 FP32 TFLOPS and 624/1248 (with sparsity) INT8 TOPS. As far as the numbers are concerned, the Kunlun II could compete against such processors as Nvidia’s A100 at least as far as AI computations are concerned.

Baidu Kunlun II’s Relative Performance

Baidu KunlunBaidu Kunlun IINvidia A100
INT8256 TOPS512 ~ 768 TOPS624/1248* TOPS
INT/FP1664 TOPS128 ~ 192 TOPS312/624* TFLOPS (bfloat16/FP16 tensor)
Tensor Float 32 (TF32)156/312* TFLOPS
INT/FP3216 TOPS32 ~ 48 TOPS19.5 TFLOPS
FP64 Tensor Core19.5 TFLOPS
FP649.7 TFLOPS

*With sparsity

While comparing different AI platforms in terms of peak performance is not always a good business since a lot depends on software, the numbers can give some idea about capabilities of the 2nd generation Kunlun AI processor.

Baidu started to work on its Kunlun AI chip project back in 2011. Initially, the company researched and emulated its many-small-core XPU microarchitecture using FPGAs, but in 2018 the company finally built a dedicated silicon that was built using one of Samsung Foundry’s 14nm fabrication process (presumably 14LPP). The chip is equipped with two 8GB HBM memory packages that offer a 512GB/s peak bandwidth.

By now, the 1st generation Kunlun is used in Baidu’c cloud datacenters, for the company’s Apolong autonomous vehicle platform, and for a number of other applications.