Last year, I wrote a blog article on the continuing importance of data locality. But I continue to hear storage and big data vendors talk about how it is now irrelevant. Meanwhile, the problems of data gravity are now widely recognized – some of the same people that say data locality is not relevant say that data gravity is. But they are just two sides of the same coin!
Data Gravity is the apparent attraction of data to other data. But data is totally passive; its takes processors to do anything with data. And when the processing requires multiple sources of data, those data must all be brought to the same place for processing. It is this communication that represents the greatest cost and the greatest bottlenecks in modern computing. Obviously, when there is data locality there is much less of a communication problem.
Ethernet networking technology has been seeing incredible price/performance improvements in the past few years. In 2017, 25Gbps and 100Gbps to the server will be quite common – 100Gbps throughout the backbone will be the norm. Yet even with these incredible improvements, the network remains the bottleneck for big data processing, especially for remote storage.
Years ago, the Hadoop community came up with a rule-of-thumb that is still quite popular – allocate 1 CPU core for every disk drive. This is not a bad for a rule-of-thumb, but is hardly science. It still works because the speed of individual CPU cores and the read/write bandwidth of hard drives both are increasing at a very slow rate. But both CPUs and hard drives are changing rapidly in other ways. Hard drives have grown phenomenally in capacity (12TB today!), and the number of CPU cores per CPU chip still tracks Moore’s law. In 2017, Intel is expected to release the Xeon Skylake-DP processors with up to 32 cores (64 threads) per chip! That means a typical server with up to 64 cores. Given that a typical hard drive can stream data at about 1Gbps, that means you’ll need up to 64Gbps of storage bandwidth to keep such a server busy!
Multiply that 64Gbps by the number of servers in your Hadoop cluster and you’ll get a very big number which no centralized storage system is capable of processing. Scale-out storage backends might be able to scale that high, but the cost of the network bandwidth between the processing cluster and the storage cluster will kill you. Only through the use of local storage can you scale the storage bandwidth to keep up with the processing power.
Big Data analytics is unique in its need to consume huge amounts of storage bandwidth in parallel. This is because of the unstructured nature of the data, for which programs constantly need to count, search, index, or sort the data. Traditional databases impose structure and create indices by forcing the data to conform to some defined schema. But with big data, analysts are constantly searching for the schema that makes sense for their specific queries.
The DriveScale system offers the operational and economic advantages that come with the separation of compute and storage, while preserving the benefits for data locality and the ability to scale big data clusters to thousands of servers. With DriveScale, data gravity won’t get you down!