The foundation of any architecture for disaggregation and composition is the choice of the fabric with which to connect the disaggregated components. Today we look at what it takes to use Ethernet as the fabric for composable infrastructure.
But what is a fabric? Many vendors view a fabric as a highly controlled, small scale, single-vendor interconnect for some larger system. We’ll refer to these as “Embedded’ Ethernet products. The vendors of these products get the benefits of using Ethernet, but to the users, the fabric is a black box.
At DriveScale, we reject the notion of Embedded Ethernets. Our products let you use the very same network you already have on your servers to compose external storage resources into those servers. Let’s call this the “General” Ethernet scenario. Of course, a user may choose to deploy separate networks/fabrics to meet his own availability or performance concerns, but that is not a requirement of our product. And we do have bandwidth requirements – at least 10GbE on each server for spinning disks and 25GbE for NVMe flash. But most servers today are shipping with 2x25GbE ports, and 100GbE is increasingly common.
Supporting General Ethernet as a fabric is no walk in the park. It has been almost 40 years since the original DEC/Intel/Xerox Ethernet standard was published. With 40 years of evolution, adoption, use and misuse, datacenter Ethernets can be horribly messy things. There’s no such thing as a typical Ethernet – there are major sects and factions, constant evolution, and an incredible proliferation of capabilities. But the adoption of Ethernet as the “one true network” remains a key determiner of data center success. Do ANY of the hyperscalers use anything besides Ethernet? Has there been any new Fibre Channel customer in the past 10 years? Are there any major systems vendors adopting PCI Express as an external fabric? No, no, and no.
The DriveScale solution works hard to get the most out of any Ethernet. We have topology detection to discover bandwidth domains, aggressive use of multi-pathing, and patented load-balancing techniques to get the most throughput for heavy data flows. But none of this is inflicted on the user – it just happens transparently.
Let’s take a look at the fundamentals of Ethernet performance – bandwidth, latency, and CPU occupancy/overhead.
Ethernet switch chips are the highest bandwidth devices in the world. There are vendors offering chips with 128 or 256 ports of 100 Gigabits each! In mainstream datacenter racks, 100GbE is readily available to the server, and often the highest cost components are the cables, not the electronics. In the backbone/optical domain, 400GbE is already being deployed. So bandwidth is not a problem – unless you’re stuck on some 20 year old switch deployment.
Latency is where Ethernet has a terrible reputation. But like much “common knowledge”, it is unfounded. Ethernet got this reputation when the Cisco Catalyst 600 switches, used for 100Mb and 1Gb Ethernet, were the dominant datacenter platform. Those particular switches could have latencies of hundreds of microseconds, but it was never true of all Ethernet switches. All modern Ethernet switches have latencies under 1 microsecond.
CPU occupancy does tend to be high with Ethernet because of the use of the TCP protocol. Most TCP offload schemes fail because TCP is deeply intertwined with 40 years of operating system and application deployment. But overall, the vast majority of datacenter servers are running with lots of excess CPU capability – and the network is busy when the application does I/O, not when it is computing. So CPU offload is rarely a concern. However, Ethernet now can support RDMA through the “RoCE” protocol – which solves the CPU occupancy problem. Yet RDMA can be very difficult to deploy for large heterogeneous environments and works best in an “Embedded” fabric.
So let’s do a composability scorecard. We compare General Ethernet, Embedded Ethernet, and, as a baseline, internal DAS (direct-attached storage). DAS is prevalent in scale-out servers today.
- Embedded fabrics are, by definition, not compatible with existing infrastructure
- DAS is usually massively over-provisioned, and always un-poolable, resulting in much greater cost
- DAS is captive to a server. Server downtime means data downtime.
- Typically, security tools such as firewall rules and VLANs cannot be used.
The NVMe over Fabrics protocol standard is a huge boon to disaggregation and composability. It provides a way to use network-based NVMe drives with little or no performance degradation. But NVMe-o-F is new, and RDMA is not widely deployed or understood, so alternate transports such as NVMe on TCP (coming soon) and good old iSCSI can be the most practical way to deliver composable SSDs.
Stay tuned as we next create our Scorecard for Ethernet as a Fabric.