Many companies have high hopes for AI to revolutionize their business, but those hopes can be quickly crushed by the staggering costs of training sophisticated AI systems. Elon Musk has pointed out that engineering problems are often the reason why progress stagnates. This is particularly evident when optimizing hardware such as GPUs to efficiently handle the massive computational requirements of training and fine-tuning large language models.
While big tech giants can afford to spend millions and sometimes billions on training and optimization, small to medium-sized businesses and startups with shorter runways often find themselves sidelined. In this article, we’ll explore a few strategies that may allow even the most resource-constrained developers to train AI models without breaking the bank.
In for a dime, in for a dollar
As you may know, creating and launching an AI product — whether it’s a foundation model/large language model (LLM) or a fine-tuned down/stream application — relies heavily on specialized AI chips, specifically GPUs. These GPUs are so expensive and hard to obtain that SemiAnalysis coined the terms “GPU-rich” and “GPU-poor” within the machine learning (ML) community. The training of LLMs can be costly mainly because of the expenses associated with the hardware, including both acquisition and maintenance, rather than the ML algorithms or expert knowledge.
Training these models requires extensive computation on powerful clusters, with larger models taking even longer. For example, training LLaMA 2 70B involved exposing 70 billion parameters to 2 trillion tokens, necessitating at least 10^24 floating-point operations. Should you give up if you are GPU-poor? No.
Alternative strategies
Today, several strategies exist that tech companies are utilizing to find alternative solutions, reduce dependency on costly hardware, and ultimately save money.
One approach involves tweaking and streamlining training hardware. Although this route is still largely experimental as well as investment-intensive, it holds promise for future optimization of LLM training. Examples of such hardware-related solutions include custom AI chips from Microsoft and Meta, new semiconductor initiatives from Nvidia and OpenAI, single compute clusters from Baidu, rental GPUs from Vast, and Sohu chips by Etched, among others.
While it’s an important step for progress, this methodology is still more suitable for big players who can afford to invest heavily now to reduce expenses later. It doesn’t work for newcomers with limited financial resources wishing to create AI products today.
What to do: Innovative software
With a low budget in mind, there’s another way to optimize LLM training and reduce costs — through innovative software. This approach is more affordable and accessible to most ML engineers, whether they are seasoned pros or aspiring AI enthusiasts and software developers looking to break into the field. Let’s examine some of these code-based optimization tools in more detail.
Mixed precision training
What it is: Imagine your company has 20 employees, but you rent office space for 200. Obviously, that would be a clear waste of your resources. A similar inefficiency actually happens during model training, where ML frameworks often allocate more memory than is really necessary. Mixed precision training corrects that through optimization, improving both speed and memory usage.
How it works: To achieve that, lower-precision b/float16 operations are combined with standard float32 operations, resulting in fewer computational operations at any one time. This may sound like a bunch of technical mumbo-jumbo to a non-engineer, but what it means essentially is that an AI model can process data faster and require less memory without compromising accuracy.
Improvement metrics: This technique can lead to runtime improvements of up to 6 times on GPUs and 2-3 times on TPUs (Google’s Tensor Processing Unit). Open-source frameworks like Nvidia’s APEX and Meta AI’s PyTorch support mixed precision training, making it easily accessible for pipeline integration. By implementing this method, businesses can substantially reduce GPU costs while still maintaining an acceptable level of model performance.
Activation checkpointing
What it is: If you’re constrained by limited memory but at the same time willing to put in more time, checkpointing might be the right technique for you. In a nutshell, it helps to reduce memory consumption significantly by keeping calculations to a bare minimum, thereby enabling LLM training without upgrading your hardware.
How it works: The main idea of activation checkpointing is to store a subset of essential values during model training and recompute the rest only when necessary. This means that instead of keeping all intermediate data in memory, the system only keeps what’s vital, freeing up memory space in the process. It’s akin to the “we’ll cross that bridge when we come to it” principle, which implies not fussing over less urgent matters until they require attention.
Improvement metrics: In most situations, activation checkpointing reduces memory usage by up to 70%, although it also extends the training phase by roughly 15-25%. This fair trade-off means that businesses can train large AI models on their existing hardware without pouring additional funds into the infrastructure. The aforementioned PyTorch library supports checkpointing, making it easier to implement.
Multi-GPU training
What it is: Imagine that a small bakery needs to produce a large batch of baguettes quickly. If one baker works alone, it’ll probably take a long time. With two bakers, the process speeds up. Add a third baker, and it goes even faster. Multi-GPU training operates in much the same way.
How it works: Rather than using one GPU, you utilize several GPUs simultaneously. AI model training is therefore distributed among these GPUs, allowing them to work alongside each other. Logic-wise, this is kind of the opposite of the previous method, checkpointing, which reduces hardware acquisition costs in exchange for extended runtime. Here, we utilize more hardware but squeeze the most out of it and maximize efficiency, thereby shortening runtime and reducing operational costs instead.
Improvement metrics: Here are three robust tools for training LLMs with a multi-GPU setup, listed in increasing order of efficiency based on experimental results:
- DeepSpeed: A library designed specifically for training AI models with multiple GPUs, which is capable of achieving speeds of up to 10X faster than traditional training approaches.
- FSDP: One of the most popular frameworks in PyTorch that addresses some of DeepSpeed’s inherent limitations, raising compute efficiency by a further 15-20%.
- YaFSDP: A recently released enhanced version of FSDP for model training, providing 10-25% speedups over the original FSDP methodology.
Conclusion
By using techniques like mixed precision training, activation checkpointing, and multi-GPU usage, even small and medium-sized enterprises can make significant progress in AI training, both in model fine-tuning and creation. These tools enhance computational efficiency, reduce runtime and lower overall costs. Additionally, they allow for the training of larger models on existing hardware, reducing the need for expensive upgrades. By democratizing access to advanced AI capabilities, these approaches enable a wider range of tech companies to innovate and compete in this rapidly evolving field.
As the saying goes, “AI won’t replace you, but someone using AI will.” It’s time to embrace AI, and with the strategies above, it’s possible to do so even on a low budget.
Ksenia Se is founder of Turing Post.