I had the privilege of attending the DellEMC HPC Community meeting prior to the SuperComputing ’17 conference. There was a lot of discussion of Deep Learning as a rapidly growing workload of concern to almost every enterprise. The meeting was topped off by the announcement of the Dell EMC PowerEdge C4140, a server designed specifically to support very dense GPU accelerated computing.
Deep learning requires incredible amounts of compute horsepower, even for the “inference” stage which is what applications actually use. However, during the design, refinement, and training of the neural network models, huge amounts of data is needed in addition to enormous amounts of compute. Training a model with fewer than a million samples typically does not result in good accuracy – and developing the model may take thousands of iterations to find a good model. This results in huge amounts of data transfer in large clusters of GPU accelerated servers.
This data workload is actually pretty similar to what is used for large scale data analytics, so most of the large deep learning developers use HDFS – the Hadoop Distributed File System – that can deliver the aggregate performance needed. HDFS uses servers with Direct-Attached Storage (DAS) – which has far better cost and much more scalable performance than any traditional SAN or NAS storage solution.
But now the cluster architects have a conundrum – should they buy servers built for dense GPU computing, or servers with lots of DAS storage to support HDFS? Getting both in one package is really not possible any more. So many users are stuck with buying *both* types of servers, even though the CPUs in those servers are little more than “babysitters” of the GPUs or storage. And separating the compute node from the server node results in network inefficiencies and scaling problems.
Fortunately, DriveScale‘s Software Composable Infrastructure is able to solve the conundrum. Using the C4140 or similar servers for hosting dense GPUs, logical servers can be composed with commodity JBOD based storage – resulting in servers which are good at handling both GPU computing and HDFS. This reduces server count, reduces the number of types of servers that must be maintained, and makes much better use of the CPUs in the servers.
As network fabrics mature, we’ll start to see the GPUs themselves become dis-aggregated from the servers, to make boththe servers and GPUs more dense and efficient. DriveScale is already working with hardware partners to include GPUs in future composability solutions.
Deep Learning needs DriveScale!