Durability is one of the fundamental properties that people expect from data bases and file systems. From Wikipedia: “The durability property ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors.” Achieving durability has never been easy, and with the proliferation of new types of databases, distributed processing, virtual environments, and firmware-heavy storage devices it is harder than ever.
Durability is Hard.
Interestingly, the definition of durability seems to also change with time. In the good old days of 20 to 30 years ago, durability mostly meant that your transaction was safe on disk in the event that “the” server crashed or lost power. Back then, one would make sure the server got fixed (could be a few days), and then you could get your data back. Now, of course, no one is going to tolerate waiting a few days for their data to be back, so replication to other servers or shared storage media has become a standard expectation. Therefore:
Durability demands Replication
Replication is great at handling a lot of potential problems. The combination of replication with application level tolerance to resources coming and going is the basis of the new Scale-Out world. Replication is awesome, but is it enough? Many of the new NoSQL type databases and distributed filesystems assume that replicating data to other nodes is good-enough replication. But if your entire data center loses power, and this does happen far too frequently, then your data is not persisted. For a few apps, replicating to remote data centers can be an answer, but, generally:
Replication is not Enough
So what is the cost of durability? For persistence to hard drives, it can get ugly. Any hard drive operation that actually goes to the media can be expected to take somewhere between 5 and 25 milliseconds. Adding that much overhead to every write transaction is pretty much a no-go these days, which is why it is rare to see real durability in many systems today. Solid state disks offer hope, but are not a panacea. Although the average write time to SSDs is low, the worst case can be quite high – I’ve seen write operations take as long as 6 seconds to complete on a SSD! This is due to the complexities of the Flash Translation Layer, and the need to sometimes wait for very expensive erase block cycles. But most SSDs now have write caches implemented in DRAM – look at the SSD specifications and you’ll see that write times are often shorter than read times – around 15 micro-seconds. This means that the writes are just going to DRAM. But wait, DRAM isn’t durable!! Which is why, if you care about durability, you must only use SSDs that have power-fail protection, i.e., enough internal power to dump their DRAM caches to flash in the event of a power loss. Sadly, though, there have been many SSD models which claim to have power fail protection, yet have so many bugs that it cannot be trusted. Much diligence is required when sourcing SSDs to get the ones that really work.
The other big technology which will affect durability is Intel’s 3D XPoint memory, which promises DRAM-like performance but with durable media. Intel thinks this will change the way applications are built, but I’m dubious. After all, durability demands replication, so you still have to get the data out of the box before it can be considered persistent. So network speeds will dominate the application view of durability, just as they do today.
Durability is limited by Network Speeds
Data bases and file systems typically use a log, or journal, which is the information that must be persisted before acknowledging a transaction to a higher level. After acknowledgement, the system then continues to use the information, still in DRAM, to put data into the format and locations from which retrieval will take place. Only in the event of a crash/power failure does the log even get read. So the I/O pattern for a log is typically just a bunch of sequential writes, of low bandwidth, but which must have the lowest possible latency. Most of the NoSQL systems out there, from MongoDB to HBase, use a timeout based journal/log mechanism that guarantees persistence within N milliseconds after a transaction is acknowledged, where N ranges from 50 to 10000. But as long as N is non-zero, these systems cannot meet the real definition of durability – and application data integrity can suffer as a result. So I propose a new definition of durability:
Durability is immediate, persistent, and replicated logging.
The replication requirement for durability means that any system wishing to achieve durability becomes a distributed system. A wise man once told me that any distributed system takes 10 years to debug; any database takes 10 years to debug; and any distributed database is still being debugged! Its a hard problem, and one that should not be attempted casually.
When it comes to Consistency, another fundamental expectation from data bases, the industry has learned to adopt one of a few projects for distributed consistency instead of trying to build ad-hoc ones. Apache Zookeeper and CoreOS’ etcd are the enablers now for a huge number of distributed systems projects which would never have solved the consistency problem on their own.
So can we do for Durability what we’ve done for Consistency? Yes! I’ve recently learned of a project called Apache Bookkeeper, which solves the replicated durability problem in a very elegant way. Already in production use in places like Twitter, Salesforce, and Yahoo, Bookkeeper allows higher level projects to focus on their core problems, instead of trying to debug one more implementation of a persistent, distributed log. Bookkeeper provides the low level storage and durability for projects like Apache DistributedLog and Yahoo’s Pulsar, that in turn provide higher level services similar to the wildly popular Apache Kafka software. Unfortunately, Kafka itself does not meet my definition of durable, because it relies on replication without stable storage commits, i.e., a data center power failure could cause loss.
Need Durability? Use Bookkeeper.