Hardware dependence -- what it is and why it's a problem [Q&A]
We're currently in the middle of a global chip shortage, while at the same time major hardware companies like Intel, NVIDIA and Arm are looking to dominate the hardware market for AI and ML applications.
This creates something of an issue where models have to be tuned and optimized according to specific hardware specifications and software frameworks, sacrificing the portability that the industry has come to take for granted..
We spoke to Jason Knight, CPO at OctoML, to find out why hardware dependence is becoming such an issue.
BN: Can you describe what you mean when you say 'hardware dependence' in relation to deploying ML models and applications?
JK: Generally, when ML models are created and trained, it's done by a data scientist using software and hardware that is well suited to handling large amounts of compute and data. This creates implicit and explicit dependencies between the hardware, where a model was trained and where it can be deployed.
Let's take this example: Python is the lingua-franca of ML training today, but a developer might want to deploy that model onto a device or codebase where Python is not a good choice. Because interpreted languages like Python are rarely used in embedded and mobile contexts, the developer may not be able to deploy the model at all without expending significant porting effort.
More subtle hardware dependence shows up when a model is 'fast enough' when run on the hardware it was trained on, but is either too slow -- or doesn't fit at all -- on the device that you'd like to deploy your model and application to. The performance problems can come from either raw theoretical performance not being high enough -- for example where the particular quantization machinery is not present -- or more subtly: the software implementations of operators that you relied upon (and worked well) when building the model give lower performance on your target device because of software limitations, as opposed to hardware limitations.
Further dependence can also be introduced by the software stacks surrounding the model execution. For example, you may be familiar with installation and support of one GPU vendor's device drivers and library limitations, only to run into issues when switching to another vendor's stack and are now missing performance counters, debug capabilities, or thermal throttling characteristics that you were used to on another hardware platform.
All of these factors combine to create a very high barrier for a developer, team, or company to switch from one platform in training to another for deployment (or migrating from one deployment platform to another).
BN: What are the repercussions of this dependence?
JK: The 30,000 foot view is that this lack of agility is slowing down innovation. We can all firmly agree that ML is no longer a 'visionary' concept. Businesses are deriving practical business value, and increasing investments each year. ML is helping advance the sciences and healthcare. However, we're hitting a point of diminishing returns. ROI could be greater; the speed at which ML helps uncover breakthroughs in the aforementioned areas could be faster.
But the resources/costs/time it takes to deploy ML are driving businesses to focus more on budgets than innovation -- especially now given the economy. And this doesn't come as a surprise given 90 percent of ML compute costs are tied to inferencing production workloads.
This leaves many in the industry scratching their heads wondering: why don't we have a cost-effective option to shift around models at peak performance like we do with standard software development in the cloud? This is the major problem we face in the ML industry today.
BN: There is software that comes 'built into' certain pieces of hardware? Is that not enough?
JK: Even in a perfect world, where all ML software that hardware vendors provide gives optimal performance for every ML workload and has every feature that developers desire, there are still hardware limitations such as memory size, memory bandwidth, compute speed, quantization format support, etc. that create barriers between moving ML models between one hardware platform and another. There are also the switching costs described above between training and deployment software ecosystems.
And let’s be clear: ML software at all levels of the stack is far from perfect. The field moves rapidly and the search to converge on shared/optimal software patterns is still a work in progress. We are still a long way off from the ML equivalent of POSIX, x86, OSI networking stack, HTTP, or other near universal API convergences that have happened in computing’s history. And as a result, there are still many rough spots in the implementations that you find as hardware vendors struggle to keep up with the pace of innovation.
BN: How can ML practitioners break this dependence?
JK: The shortcut path that most ML practitioners use today (wherever possible) is to work around the problem by trying to maintain a training environment (both software and hardware) that is as similar as possible to your deployment environment. However, this is often costly (oversized/specialized hardware at deployment time) and many practitioners do not have this choice. This 'workaround' is one of the reasons why NVIDIA has been so strong in the ML market at large: their high market penetration in ML training puts the center of 'ML ecosystem mass' strongly in their favor.
To break the dependence, we need better abstractions between ML practitioners and the hardware that they depend on. This will take time (as all industry consolidations do) and doubly so because the pace of innovation in ML is still so rapid. I expect we’ll see the most concentrated attempts at breaking the attempts come from cloud providers (who would love for their users to migrate compute heavy ML workloads onto their increasingly proprietary compute platforms) and motivated startups who are able to apply innovative solutions to creating these new abstractions. One example of this is OctoML, who uses machine learning to develop portable ML software experiences for users to take a model and optimize it for a range of hardware.
BN: Do you think this is a problem that can be surpassed now or will it take several years?
JK: We can make significant inroads with focused efforts by abstracting away some of this complexity for users. But for the broader industry to coalesce on one or two APIs/frameworks/open standards will take five plus years, if ever.
To see how this could play out, look at the distributed computing industry where we have Kubernetes as the defacto standard in place today, yet there is still a significant amount of innovation happening in the Function-as-a-Service domain to eclipse Kubernetes as the API to deploy web workloads onto. Or go even further back to the database industry which has recently seen the explosion of NoSQL and now hosted data lake platforms like Snowflake.
BN: Ultimately, what will hardware independence enable?
JK: Hardware independence will enable more rapid innovation in intelligent applications. Today the ability to roll out sophisticated AI applications like ChatGPT, StableDiffusion, or AlphaZero is available to only the most talented of technologists with deep pockets for serving large models on expensive hardware. True hardware independence will enable better abstractions for application developers that lower the complexity and cost of deploying ML. This will especially unlock innovation for ML on the edge, where the variety of ML acceleration hardware is only now emerging but still locked behind (often) proprietary APIs and implementations that slow the pace of innovation.
Image credit: Alexmit/depositphotos.com