From the blog.

Managing Digital Racket
The more I tune out, the less I miss it. But that has presented me with some complex choices for a nuanced approach to curb
Complexity – My Friend, My Enemy
Over my years of network engineering, I've learned that the fewer features you can implement while still achieving a business goal, the better. Why? Fewer

Thinking Through A Mellanox Dual-Tier Fixed Switch 10G + 40G Fabric Design

1,265 Words. Plan about 8 minute(s) to read this.

When preparing for “The Ethernet Switching Landscape” presentation at Interop, I did quite a bit of reading through vendor marketing literature and documentation. One proposed fabric design by Mellanox stuck in my brain, because I was having a hard time mapping the numbers in their whitepaper to what the reality would look like. Here’s the slide I built around a section in the Mellanox whitepaper.

Click to BIGGIFY.

Click to BIGGIFY.

The point of this design was to demonstrate how to scale out a fabric supporting a large number of 10GbE access ports using fixed configuration Mellanox switches. The trick with fixed configuration switches is the obvious fact that there are a limited number of ports available. Therefore, a design can scale sideways only so far before the switches run out of ports to connect leaf and spine tiers together. This design idea puts in a second leaf layer to expand the aggregate fabric size. Note that this is not an uncommon design idea unique to Mellanox – it just happens that I ran across a specific implementation of it in one of their whitepapers.

For whatever reason, I was having a hard time getting my brain to see how this would cable up, which ports were facing where, what the oversubscription would look like, etc. I think it was the way the whitepaper worded the description. I saw where they were headed, but I wasn’t quite getting the layout. So, I spent some time in Scapple to draw it all out. The result is the beastly large diagram below.

Dual-tier 40G Fabric Mellanox

4.9MB PNG, 6727 × 2014. Be patient downloading & viewing full-size. This image is really big.

I admit that it wasn’t strictly necessary to draw the entire proposed Mellanox design switch for switch, but I still found it to be a useful exercise. Seeing the entire layout in front of me got my brain spinning on some observations about large fabric design.

  1. The sheer number of cables required in this design is astonishing. I don’t need to do the math in order to make my point. Cabling is an expense not to be underestimated when designing large fabrics.
  2. Optic cost could well drive the choice of vendor in a design like this. When the number of optics required is as many as you see here, a swing in optic price between vendors by as little as 10-20% could result in a meaningful difference in capex.
  3. Fixed switch config designs do scale large, but don’t scale infinitely. That said, the data center that would require more than 6,912 access ports of 10GbE to exist in the same fabric would be quite unusual. But my point is really this: the Mellanox SX-1036 used in the spine of this design has 36 40GbE ports. When you connect port 36, you’re done. You cannot connect additional leaf switches. The fabric in my diagram is as big as it gets. Conversely, if this fixed configuration SX-1036 were instead some sort of chassis switch, it would be possible to scale the spine more widely via additional line cards, assuming the leaf tier switch allowed it.
  4. The finite 40GbE spine switch begs the 100GbE question. At what point does it make sense to move over to 100GbE for leaf-spine tier interconnections? That’s a math problem in two ways. One is that of bandwidth requirements in the context of oversubscription tolerance. The other is cost for 100GbE ports. I know Arista really wants folks to ask this exact question and then shop their 7500 chassis with 7500E line cards.
  5. Replicating this size or smaller fabrics within a data center and then interconnecting those fabrics seems interesting. However, interconnecting multiple fabric islands implies a loss of freedom at the higher layers of the data center. Part of the point of a fabric is to allow any-to-any communication with predictable hop count and forwarding characteristics no matter what is speaking to what. As soon as a bridge connecting fabric islands is introduced, the freedom for application architects to place compute loads anywhere they like is curtailed, as the bridge represents an out-of-fabric path, i.e. one that no longer has predictable forwarding characteristics. Ergo, generally speaking, a single fabric that can scale to the needs of the entire compute infrastructure is desirable.
  6. Mellanox interconnects the access tier leaf switches to a single aggregation tier leaf switch. I pondered this for a while, because the normal design for smaller enterprise class data centers is to dual-home virtually everything. “Triangles not squares,” is a design mantra that I think I picked up from Cisco literature. Seeing the Mellanox plumbing made me rethink where the redundancy would come from. I came up with a couple of answers. One is that the application itself would be redundant. In other words, applications running as clusters spread across multiple physical hosts are less concerned with whether or not the underlying physical network is robust. If some compute element in an application cluster goes away due to a network failure, the application soldiers on, albeit a little more slowly. The other is that the physical hosts themselves could attach to multiple access-layer switches (although Mellanox didn’t happen to draw the design this way). This would come down to ToR switch physical placement in racks to facilitate. Indeed, I have seen exactly this design in the past, where two ToR switches racked on on top of the other represent redundant network paths. Physical hosts are plumbed to both ToRs in a way that is redundant for the underlying operating system.
  7. In the context of the point above, my understanding is that certain compute loads demand that the network transport characteristics be identical at all parts of the data center. In other words, a compute node that lost half of its bandwidth due to a network component failure might well serve the wider application better by simply dropping off the grid. Running at half-speed could be detrimental to aggregation application performance as transactions sent to that particular cluster member are not completing in the amount of time expected by those who tuned the application. That’s counter-intuitive to those of us with experience building network platforms for enterprise applications, but it’s an interesting point I’ve heard brought up in the context of HPC. Dual-homing everything isn’t always the right answer.
  8. This design creates a 3:1 oversubscription between access leaf switch and aggregation leaf switch. 3:1 is a general rule of thumb in large-scale fabric design that I’ve heard a number of times now. Designing for a smaller access-layer oversubscription ratio (for example, 2:1) drives up the overall cost of the fabric. While a totally non-blocking fabric might sound like nirvana, there is an easy to quantify price tag associated with it in the form of additional switches, cables, and optics, as well as RUs and power consumed along with BTUs generated. Reality is that most applications don’t require true non-blocking all the way through the fabric. Spine layer? Yes, in order to guarantee predictable forwarding behavior. Leaf layer? No, as it’s assumed that applications are bursty, and do not generate line-rate data streams 100% of the time.

Links

There’s several fabulous articles out there on large-scale fabric design. Reading up on CLOS theory is also useful in this context. Here’s a few articles you might like to view.

Cheap Network Equipment Makes a Better Data Centre
Greg Ferro, etherealmind.com

Construct a Leaf Spine design with 40G or 10G? An observation in scaling the fabric.
Brad Hedlund, bradhedlund.com

Full mesh is the worst possible fabric architecture
Ivan Pepelnjak, blog.ipspace.net

Clos network
Wikipedia