An algorithm could make CPUs a cheap way to train AI

AI is the backbone of technologies such as Alexa and Siri — digital assistants that rely on deep machine learning to do their thing. But for the makers of these products — and others that rely on AI — getting them “trained” is an expensive and often time-consuming process. Now, scientists from Rice University have found a way to train deep neural nets more quickly, and more affordably, through CPUs.

Typically, companies use GPUs as acceleration hardware in implementing deep learning in technology. But this is pricey — top of the line GPU platforms cost around $100,000. Rice researchers have now created a cost-saving alternative, an algorithm called sub-linear deep learning engine (SLIDE) that is able to do the same job of implementing deep learning, but without the specialized acceleration hardware.

The team then took a complex workload and fed it to both a top-line GPU using Google’s TensorFlow software, and a “44-core Xeon-class CPU” using SLIDE, and found the CPU could complete the training in just one hour, compared to three and a half hours for the GPU. (There is, to our knowledge, no such thing as a 44-core Xeon-class CPU, so it’s likely that the team is referring to a 22-core, 44-thread CPU.)

SLIDE works by taking a fundamentally different approach to deep learning. GPUs leverage such networks by studying huge amounts of data — often using millions or billions of neurons, and employing different neurons to recognize different types of information. But you don’t need to train every neuron on every case. SLIDE only picks the neurons that are relevant to the learning at hand.

According to Anshumali Shrivastava, assistant professor at Rice’s Brown School of Engineering, SLIDE also has the advantage of being data parallel. “By data parallel I mean that if I have two data instances that I want to train on, let’s say one is an image of a cat and the other of a bus, they will likely activate different neurons, and SLIDE can update, or train on these two independently,” he said. “This is much a better utilization of parallelism for CPUs.”

This did bring its own challenges, however. “The flipside, compared to GPU, is that we require a big memory,” he said. “There is a cache hierarchy in main memory, and if you’re not careful with it you can run into a problem called cache thrashing, where you get a lot of cache misses.” After the team published their initial findings, however, Intel got in touch to collaborate on the problem. “They told us they could work with us to make it train even faster, and they were right. Our results improved by about 50 percent with their help.”

SLIDE is a promising development for those involved in AI. It’s unlikely to replace GPU-based training any time soon, because it’s far easier to add multiple GPUs to one system than multiple CPUs. (The aforementioned $100,000 GPU system, for example, has eight V100s.) What SLIDE does have, though, is the potential to make AI training more accessible and more efficient.

Shrivastava says there’s much more to explore. “We’ve just scratched the surface,” he said. “There’s a lot we can still do to optimize. We have not used vectorization, for example, or built-in accelerators in the CPU, like Intel Deep Learning Boost. There are a lot of other tricks we could still use to make this even faster.” However, the key takeaway, Shrivastava says, is that SLIDE shows there are other ways to implement deep learning. “Ours may be the first algorithmic approach to beat GPU, but I hope it’s not the last.”