This week I attended the fourth edition of ScaledML, a conference dedicated to large-scale Machine Learning. It was a great event with top-tier presenters that showcased where the field is heading. In summary, ML has transitioned out of the research labs and into the fray of mainstream applications. As a result, the industry is coming to terms with how to provide the hardware and software platforms needed by these workloads.
The state-of-the-art in ML is what we also call Deep Learning, i.e., learning with Deep Neural Network models. They provide the best results in computer vision, speech, and text processing and are being used in applications as diverse as self-driving cars, voice assistants, and automatic translation. The exciting part for systems architects is that as Deep Learning models get more sophisticated they require additional orders of magnitude more data and compute.
ScaledML could not have been better timed as the ACM has just named the Fathers of Deep Learning – Yoshua Bengio, Geoffrey Hinton, and Yann LeCun – as the latest recipients of the Turing Award, often referred to as the “Nobel Prize of Computing”. The opening keynote was also perfectly suited as it was delivered by one of last year’s Turing Award winners, David Patterson, from UC Berkeley and Google. David presented his well-known view that we’re in a new Golden Age for computer architecture, especially because of the activity around Domain-Specific Architectures (DSAs) for Deep Learning. Under his leadership, Google has been developing the Tensor Processing Units (TPUs) that power ML applications in its cloud.
Of course, the big player in hardware for Deep Learning is NVIDIA who spent the past couple of decades sharpening their GPUs to accelerate 3D gaming. That is until 2012, when academics led by Turing Laureate Geoffrey Hinton figured they could use those same GPUs to run the matrix multiplications needed to seriously scale neural nets. At ScaledML, Stanford Professor and NVIDIA’s Chief Scientist, Bill Dally, argued that since the era of Moore’s Law and Dennard scaling is over, hardware architects will need to aggressively come up with tricks to keep training larger models on larger datasets. In the latest generations of NVIDIA GPUs, these tricks include using mixed-precision arithmetic and taking advantage of sparsity in model connections and activations.
It’s clear that Deep Learning has sparked a gold rush to figure out who will become the “Intel of the AI era”. One company that is pulling its efforts to make sure it becomes the “Intel of AI” is, not surprisingly, Intel itself. Wei Li, who is Intel’s VP and General Manager for AI showed that Intel’s future generations of CPUs will eventually include tensor operations to accelerate Deep Learning.
One of the several startups working in this space, Graphcore, also came to ScaledML. Co-founder and CTO Simon Knowles talked about their Intelligence Processing Unit (IPU) which was recently announced in an OEM agreement with Dell EMC to provide a compute box with 8 IPU PCIe cards (in a similar style to the NVIDIA DGX boxes).
In an attempt to standardize the comparison between all these hardware providers and the myriad of software frameworks that can be used with them, researchers and engineers have come up with MLPerf. MLPerf is a set of benchmarks from well-known ML tasks and datasets. As of this writing, NVIDIA, Google, and Intel have all submitted results to the MLPerf website and I expect more providers will report their numbers soon.
There’s obviously a lot of attention being given to accelerator hardware, but ScaledML also had several presentations dedicated to the software frameworks used to create ML models. Facebook talked about how its research-friendly framework, PyTorch, was updated to streamline the deployment of models into production at large scale. NVIDIA is tackling the data pipeline bottleneck with their RAPIDS libraries which make it easier to run feature engineering and ETL tasks inside the GPU itself, so that once data finishes pre-processing, it is already sitting at the right place for training. Databricks presented MLFlow, an end-to-end framework for supporting the development and deployment of ML applications from data processing to model training, and from deployment into production. Unlike other projects in this space (think TensorFlow Extended or Facebook’s FBLearner), MLFlow focuses on integrating components from different stacks and doesn’t require the developer to buy into a specific library for training or serializing models and datasets.
It is not easy to tell if and when Deep Learning’s compute demands will taper off and how many tricks hardware designers can pull out of their hats. Maybe when we are over the hype, we will start seeing commoditized versions of the massive compute boxes like NVIDIA’s DGX. In a world where composability is everywhere, it is only natural to expect a big box of processing power that can be sliced and matched to storage on demand depending on the workload of the day.