There are a number of papers and blog posts stating that disk locality for Hadoop is no longer relevant. I disagree in the strongest terms, and present my case here.
I take the view that the arguments for or against disk locality should be considered with BIG data (Hadoop, duh!) where aggregate data sets can be many petabytes, and individual files may be many terabytes. Because the file sizes can be so huge, traditional caching in memory or on SSDs is not going to reduce the importance of disk locality.
But what is locality? How is it measured? First off, unlike DRAM, NVRAM, and SSDs, latency is not the reason that you desire locality for disks. Disks seeks are so slow compared to processors and networks that the latency to disk truly is irrelevant to locality. Bandwidth, however, is a different story. Many current, cheap, 7200 RPM bulk storage drives can transfer at sustained rates of about 2Gbps. Hadoop, with its huge logical blocks sizes of 128MB or more, is designed to amortize the cost of seeking by transferring large amounts at once. Realistically, we’ve seen long term transfer rates of up to 600Mbps, across many disks simultaneously.
Hence, we should define locality as the typical bandwidth available between endpoints, i.e., a disk and the server which is using it. Obviously, the bandwidth depends on the interconnection technology, link speeds, and interfering traffic. So a typical storage-laden server with 12 or 24 SATA or SAS disks connected to an LSI Logic SAS controller (8 lanes @ 6Gbps) has plenty of bandwidth to make sure the disk traffic never backs up.
For network connected disks the picture is murkier. First off, few servers are connected with better than 10Gbps Ethernet; if they are dual-connected then this would give 20Gbps which should be adequate for up to 24 drives. But that assumes an absence of interfering traffic, which gets worse as the distance (in hops) of disks increases from the server. For disks in the same rack, there is non-blocking bandwidth between the server and disks on the switch, and the interfering traffic is low. Going out of the rack, there is typically a 4:1 over-subscription of bandwidth, so a busy system would have 1/4 the bandwidth for the disks (20G goes to 5G, good for 5 or 6 disks). Each further hop that disk traffic takes make have further over-subscription and/or interfering traffic, so any guarantees get pretty weak as you get further away.
(a) network bandwidths are increasing much faster than disk bandwidths and
(b) new network architectures which provide much greater aggregate bandwidth are being rapidly deployed.
Hadoop Systems – Number of Disks
Crucially, 1 does not really distinguish between in-rack and out-of-rack network characteristics, so their results may only be confirming the fact that in-rack bandwidth is plentiful. Point (a) may be true, but what is important is the network bandwidth vs. the aggregate bandwidth of disks. An old rule of thumb for Hadoop systems was to have the number of disks equal to the number of cores in a server. But the speed and number of cores in a server keeps growing (faster than network bandwidth?), so the number of disks and their bandwidth to keep each server fed grows very rapidly.
Point (b) in 1 is about the rapid deployment of new non-blocking network architectures. But if you look at the leading edge backbone network deployments shown in 5, 6, and 7 it is clear that the new non-blocking network applies to the bandwidth between racks and not between servers. For each of these networks there is still substantial over-subscription of traffic leaving the rack, meaning that the disks had better be local to the rack.
Reference 4 shows a performance test with Isilon storage in which disk locality is supposedly shown to be irrelevant, but the test says nothing about competing traffic on the network or Isilon boxes, so it is very unrealistic. And if it were true that disk locality was unimportant with Isilon, why does 8 explain the “virtual rack” feature like this?: “a virtual rack can, for example, route a client’s connection through its optimal network switch, which improves performance by reducing read throughput latency while minimizing traffic through a top-of-rack switch“. And what the heck is “throughput latency”? Throughput or latency?
So in summary, disk locality IS relevant. However, we should view in-rack as local; as long as a disk is in the same rack as a server, it’s OK to use the network to access the disk.
- Disk-Locality in Datacenter Computing Considered Irrelevant Ananthanarayanan, et al, UC Berkeley
- Disk-Locality in Datacenter Computing Considered Irrelevant (and then what?) Ananthanarayanan, et al, UC Berkeley
- The Elephant in the Big Data Room: Data Locality is Irrelevant for Hadoop Tom Phelan, BlueData
- Comparing Hadoop performance on DAS and Isilon and why disk locality is irrelevant Stefan Radtke’s Blog
- Project Altair: The Evolution of LinkedIn’s Data Center Network Shawn Zandi, LinkedIn
- What It Takes to Run Hadoop at Scale Yahoo
- Introducing data center fabric, the next-generation Facebook data center network
- EMC Isilon Best Practices for Hadoop Data Storage