Watch the Low-Code/No-Code Summit on-demand sessions to learn how to successfully innovate and achieve efficiencies by upskilling and scaling citizen developers. Watch now.
Foundation models have the potential to change the way organizations build and train artificial intelligence (AI) with machine learning (ML).
A major challenge for building foundation models is that until now they have generally required the use of specific types of network and infrastructure hardware to operate efficiently. There is also limited support for developers who want to build a base model with a fully open source stack. It’s a challenge that IBM research tries to help solve in different ways.
>>Don’t miss our new special issue: Zero trust: the new security paradigm.<
“Our question was: can we train foundation models, but in such a way that we do it on basic hardware? And make it more accessible instead of just being in the hands of a select few researchers,” Raghu Ganti, principal research associate at IBM, told VentureBeat.
To that end, IBM announced today that it has developed and contributed code to the open-source PyTorch machine learning project to make the technology work more efficiently with standard Ethernet-based networks. IBM has also built an open source operator that helps optimize PyTorch deployment on the Red Hat OpenShift platform, which is based on the open source Kubernetes cloud container orchestration project.
To infinity and beyond: how IBM helped expand PyTorch
To date, many base models have been trained on hardware that supports the InfiniBand networking stack typically found only on high-performance computing (HPC) hardware.
While GPUs are the foundation of AI, there is a need for powerful networking technology to connect multiple GPUs together. Ganti explained that it is possible to train large models without InfiniBand networks, but it is inefficient in a number of ways.
For example, he said that with the standard PyTorch technology, training a model with 11 billion parameters over an Ethernet-based network can be done with only 20% GPU efficiency. Improving that efficiency is what IBM did alongside the PyTorch community.
“This is a very complex problem and there are a lot of knobs to tune,” said Ganti.
The knobs that need tweaking are all about ensuring optimized GPU and network usage. Ganti said the goal is to keep both the network and GPU busy at the same time to speed up the overall training process.
The code to optimize PyTorch to work better over Ethernet was merged into the PyTorch 1.13 update that became generally available on October 28.
“We were able to go from 20% GPU usage all the way to 90%, and that’s a 4.5x improvement in terms of training speeds,” said Ganti.
Shifting PyTorch into high gear for faster training
In addition to the code improvements in PyTorch, IBM has also been working on the open-source Red Hat OpenShift Kubernetes platform to support base model development.
Ganti said part of what they’ve been doing is making sure that the maximum bandwidth the Ethernet network can provide is reflected at the pod level in OpenShift.
Using Kubernetes to train foundation models is not a new idea. Open AIthe organization behind some of the most widely used models, including GPT-3 and DALL-E publicly discussed how it uses Kubernetes. What is new, according to IBM, is that the technology for this is available as open source. IBM has open sourced a Kubernetes operator that provides the necessary configuration to help organizations scale a cluster to support large model training.
With the PyTorch Foundation, more open-source innovation is now possible
Until September, PyTorch was operated as an open-source project managed by Meta. That changed on September 12, when the PyTorch Foundation was announced as a new organizing body led by the Linux Foundation.
Ganti said IBM’s effort to contribute code to PyTorch actually started before the announcement of the new PyTorch Foundation. He explained that under Meta’s administration, IBM could not actually directly commit code to the project. Instead, the code had to be committed by Meta staffers who had commit access.
Ganti expects PyTorch to become more collaborative and open under the leadership of the Linux Foundation. “I think so [PyTorch Foundation] will improve open-source collaboration,” said Ganti.
VentureBeat’s mission is to become a digital city plaza where tech decision makers can learn about transformative business technology and execute transactions. Discover our Briefings.