Modern Workloads

What are Modern Workloads?

They go by several names: Next Generation, Scale-Out or Cloud Native applications (even though they aren’t exclusive to the cloud). We like to call them “modern workloads”. They are relatively new on the IT landscape, and their very nature has had a profound impact on data center architectures.

Prominent examples of modern workloads come from Big Data, IoT, social and mobile applications (see examples on the right). Unlike traditional enterprise applications such as email or accounting software, modern workloads are more dynamic and unpredictable, and have data sets which often grow exponentially. Whereas traditional applications scale modestly and can be managed by increasing the number of virtual machines running on a server, modern workloads scale horizontally, with performance or capacity increases being addressed by adding servers. The dynamic and unpredictable nature of these applications means IT has to continuously provision and re-configure resources to handle the fluctuations in resources required.

Modern Workload Architectural Principles

The developers of the Big Data architectures at Google and Yahoo were looking to design a platform that could store and process a vast quantity of data at low cost. To achieve this, they developed several key principles around system architecture to support modern workloads such as Hadoop, Spark, Cassandra, etc.

Deploy Commodity Servers – Use the lowest cost storage and compute servers – what were traditionally called “PC Servers”. Every computer manufacturer has several versions of these systems, and the competition has resulted in them becoming low margin, low cost platforms. Building clusters of 100’s or 1000’s of these servers provides vast storage and compute capacity at the lowest possible cost.

Share Nothing, Parallel Architecture – Ensure no co-dependency between servers. If something is shared, even a minor resource, clusters would reach a point where the shared component would become a bottleneck preventing further scaling. Each node needs to be independent, equal, and parallel to all other nodes in the cluster. This allows for linear scaling from 10’s to 10,000’s of nodes.

Move the Application Processing to the Data – Keep the data close to the processor. Since the cheapest way to store data is in commodity servers with high performance connectivity to its local drives, the fastest way to process huge data stores is to use each of these nodes as a parallel computing unit to process the data stored on its own drives. This “data locality” is handled by a central control node that sends the applications out to the nodes, and each node then processes its own data.

Communicate Just the Results – Move data infrequently. Both Map Reduce and in-memory applications such as Spark keep the majority of data transfers local. Only the results from each level of processing are moved to other nodes for further processing. This keeps bandwidth requirements to a minimum between nodes and especially between racks.

Build Resiliency and Efficiency in Software – Move sophisticated resiliency and efficiency functions out of the hardware and “up” into the software layers. Big Data deals with data resiliency by triplicating the data rather than by using RAID or other data recovery schemes. Management of the triple copies is handled by the file system itself, rather than the storage subsystem. This may seem less efficient, but it enables the use of commodity drives rather than sophisticated storage arrays. Commodity drives cost orders of magnitudes less than storage arrays, so the total system remains vastly less expensive. Furthermore, in very large data stores the chance of a double disk failure, which storage arrays cannot recover from, becomes a probable certainty. Triplicated (and in some cases, quadruplicated) data is much more robust against failure than RAID arrays.

Hadoop, Spark, Cassandra and other Big Data modern workloads are all based on these principles. By following them, enterprises can reap the same benefits first achieved by the hyperscale organizations that developed them.

That’s where Software Composable Infrastructure comes in. It supports all of these Big Data principles, and allows IT organizations to realize even greater cost savings with far greater agility to respond to the ever-changing nature of these modern workloads.

Modern Workloads Need a Scalable, Distributed Infrastructure

Modern workloads, such as Hadoop and other big data technologies , are typically managing datasets that are too large to be processed on a single computer. These workloads are deployed on scale-out infrastructures which support a distributed processing framework. Scale-out architectures are based on common off-the-shelf servers (COTS servers) with internal drives (Direct Attached Storage or DAS), and data centers deploy anywhere from dozens to 10,000’s of them to achieve the scale required.

Scale-out architectures offer the lowest cost option for these big data applications since the servers are standardized, available from several manufacturers and have become commodities. Scale-out also supports the data locality” principle, which is necessary for the scale and performance required by modern workloads.

Software Composable Infrastructure (SCI) is a next generation data center architecture that delivers significant advantages over standard scale-out infrastructures, as we’ll describe below. But what are the issues with these scale-out architectures?

Modern Workload Architectural Principles

The developers of the Big Data architectures at Google and Yahoo were looking to design a platform that could store and process a vast quantity of data at low cost.

Scale-Out Lacks the Agility of the Cloud

Despite the advantages of a scale-out architecture for big data applications, it lacks a key capability that is one of the drivers behind the growth of cloud computing – the ability to easily and quickly scale up or down the compute and storage resources needed for each workload.

Cloud-like agility is particularly important for modern workloads for several reasons:

  • Its difficult to ascertain the correct footprint up-front when planning the deployment of a new workload. To avoid further delays from re-configuring systems, IT teams will often over-provision resources for an application, leading to wasteful spending and poor utilization levels.

  • Modern workloads are highly dynamic, and even if you guessed reasonably well up-front, you’ll soon find that the workload will require more or less resources than initially deployed. That means you’re either scrambling to upgrade the cluster, or you have valuable resources that are left idle.

  • Data growth is unpredictable and in many cases it can grow exponentially. IT teams are constantly stressed to respond as quickly as they can to meet the needs of the business. But provisioning new servers is not a quick task and can result in missed SLAs to the organization.

  • As the value of big data grows within enterprises, so does the number and size of the workloads being deployed. This leads to multiple application clusters being created, increasing the inflexibility of the infrastructure as resources become “siloed” in separate clusters.

  • Scale-out architectures often contain 1000’s of disk drives, and failures are not just anticipated, they are expected. Responding to server or disk failures is a manual effort, leading to downtime and slow application performance until the repair is completed.

To achieve this, they developed several key principles around system architecture to support modern workloads such as Hadoop, Spark, Cassandra, etc.

Deploy Commodity Servers – Use the lowest cost storage and compute servers – what were traditionally called “PC Servers”. Every computer manufacturer has several versions of these systems, and the competition has resulted in them becoming low margin, low cost platforms. Building clusters of 100’s or 1000’s of these servers provides vast storage and compute capacity at the lowest possible cost.

Share Nothing, Parallel Architecture – Ensure no co-dependency between servers. If something is shared, even a minor resource, clusters would reach a point where the shared component would become a bottleneck preventing further scaling. Each node needs to be independent, equal, and parallel to all other nodes in the cluster. This allows for linear scaling from 10’s to 10,000’s of nodes.

Move the Application Processing to the Data – Keep the data close to the processor. Since the cheapest way to store data is in commodity servers with high performance connectivity to its local drives, the fastest way to process huge data stores is to use each of these nodes as a parallel computing unit to process the data stored on its own drives. This “data locality” is handled by a central control node that sends the applications out to the nodes, and each node then processes its own data.

Communicate Just the Results – Move data infrequently. Both Map Reduce and in-memory applications such as Spark keep the majority of data transfers local. Only the results from each level of processing are moved to other nodes for further processing. This keeps bandwidth requirements to a minimum between nodes and especially between racks.

Build Resiliency and Efficiency in Software – Move sophisticated resiliency and efficiency functions out of the hardware and “up” into the software layers. Big Data deals with data resiliency by triplicating the data rather than by using RAID or other data recovery schemes. Management of the triple copies is handled by the file system itself, rather than the storage subsystem. This may seem less efficient, but it enables the use of commodity drives rather than sophisticated storage arrays. Commodity drives cost orders of magnitudes less than storage arrays, so the total system remains vastly less expensive. Furthermore, in very large data stores the chance of a double disk failure, which storage arrays cannot recover from, becomes a probable certainty. Triplicated (and in some cases, quadruplicated) data is much more robust against failure than RAID arrays.

Hadoop, Spark, Cassandra and other Big Data modern workloads are all based on these principles. By following them, enterprises can reap the same benefits first achieved by the hyperscale organizations that developed them.

That’s where Software Composable Infrastructure comes in. It supports all of these Big Data principles, and allows IT organizations to realize even greater cost savings with far greater agility to respond to the ever-changing nature of these modern workloads.