Thinking Through A Mellanox Dual-Tier Fixed Switch 10G + 40G Fabric Design

1,265 Words. Plan about 5 minute(s) to read this.

When preparing for “The Ethernet Switching Landscape” presentation at Interop, I did quite a bit of reading through vendor marketing literature and documentation. One proposed fabric design by Mellanox stuck in my brain, because I was having a hard time mapping the numbers in their whitepaper to what the reality would look like. Here’s the slide I built around a section in the Mellanox whitepaper.

Click to BIGGIFY.
Click to BIGGIFY.

The point of this design was to demonstrate how to scale out a fabric supporting a large number of 10GbE access ports using fixed configuration Mellanox switches. The trick with fixed configuration switches is the obvious fact that there are a limited number of ports available. Therefore, a design can scale sideways only so far before the switches run out of ports to connect leaf and spine tiers together. This design idea puts in a second leaf layer to expand the aggregate fabric size. Note that this is not an uncommon design idea unique to Mellanox – it just happens that I ran across a specific implementation of it in one of their whitepapers.

For whatever reason, I was having a hard time getting my brain to see how this would cable up, which ports were facing where, what the oversubscription would look like, etc. I think it was the way the whitepaper worded the description. I saw where they were headed, but I wasn’t quite getting the layout. So, I spent some time in Scapple to draw it all out. The result is the beastly large diagram below.

Dual-tier 40G Fabric Mellanox
4.9MB PNG, 6727 × 2014. Be patient downloading & viewing full-size. This image is really big.

I admit that it wasn’t strictly necessary to draw the entire proposed Mellanox design switch for switch, but I still found it to be a useful exercise. Seeing the entire layout in front of me got my brain spinning on some observations about large fabric design.

  1. The sheer number of cables required in this design is astonishing. I don’t need to do the math in order to make my point. Cabling is an expense not to be underestimated when designing large fabrics.
  2. Optic cost could well drive the choice of vendor in a design like this. When the number of optics required is as many as you see here, a swing in optic price between vendors by as little as 10-20% could result in a meaningful difference in capex.
  3. Fixed switch config designs do scale large, but don’t scale infinitely. That said, the data center that would require more than 6,912 access ports of 10GbE to exist in the same fabric would be quite unusual. But my point is really this: the Mellanox SX-1036 used in the spine of this design has 36 40GbE ports. When you connect port 36, you’re done. You cannot connect additional leaf switches. The fabric in my diagram is as big as it gets. Conversely, if this fixed configuration SX-1036 were instead some sort of chassis switch, it would be possible to scale the spine more widely via additional line cards, assuming the leaf tier switch allowed it.
  4. The finite 40GbE spine switch begs the 100GbE question. At what point does it make sense to move over to 100GbE for leaf-spine tier interconnections? That’s a math problem in two ways. One is that of bandwidth requirements in the context of oversubscription tolerance. The other is cost for 100GbE ports. I know Arista really wants folks to ask this exact question and then shop their 7500 chassis with 7500E line cards.
  5. Replicating this size or smaller fabrics within a data center and then interconnecting those fabrics seems interesting. However, interconnecting multiple fabric islands implies a loss of freedom at the higher layers of the data center. Part of the point of a fabric is to allow any-to-any communication with predictable hop count and forwarding characteristics no matter what is speaking to what. As soon as a bridge connecting fabric islands is introduced, the freedom for application architects to place compute loads anywhere they like is curtailed, as the bridge represents an out-of-fabric path, i.e. one that no longer has predictable forwarding characteristics. Ergo, generally speaking, a single fabric that can scale to the needs of the entire compute infrastructure is desirable.
  6. Mellanox interconnects the access tier leaf switches to a single aggregation tier leaf switch. I pondered this for a while, because the normal design for smaller enterprise class data centers is to dual-home virtually everything. “Triangles not squares,” is a design mantra that I think I picked up from Cisco literature. Seeing the Mellanox plumbing made me rethink where the redundancy would come from. I came up with a couple of answers. One is that the application itself would be redundant. In other words, applications running as clusters spread across multiple physical hosts are less concerned with whether or not the underlying physical network is robust. If some compute element in an application cluster goes away due to a network failure, the application soldiers on, albeit a little more slowly. The other is that the physical hosts themselves could attach to multiple access-layer switches (although Mellanox didn’t happen to draw the design this way). This would come down to ToR switch physical placement in racks to facilitate. Indeed, I have seen exactly this design in the past, where two ToR switches racked on on top of the other represent redundant network paths. Physical hosts are plumbed to both ToRs in a way that is redundant for the underlying operating system.
  7. In the context of the point above, my understanding is that certain compute loads demand that the network transport characteristics be identical at all parts of the data center. In other words, a compute node that lost half of its bandwidth due to a network component failure might well serve the wider application better by simply dropping off the grid. Running at half-speed could be detrimental to aggregation application performance as transactions sent to that particular cluster member are not completing in the amount of time expected by those who tuned the application. That’s counter-intuitive to those of us with experience building network platforms for enterprise applications, but it’s an interesting point I’ve heard brought up in the context of HPC. Dual-homing everything isn’t always the right answer.
  8. This design creates a 3:1 oversubscription between access leaf switch and aggregation leaf switch. 3:1 is a general rule of thumb in large-scale fabric design that I’ve heard a number of times now. Designing for a smaller access-layer oversubscription ratio (for example, 2:1) drives up the overall cost of the fabric. While a totally non-blocking fabric might sound like nirvana, there is an easy to quantify price tag associated with it in the form of additional switches, cables, and optics, as well as RUs and power consumed along with BTUs generated. Reality is that most applications don’t require true non-blocking all the way through the fabric. Spine layer? Yes, in order to guarantee predictable forwarding behavior. Leaf layer? No, as it’s assumed that applications are bursty, and do not generate line-rate data streams 100% of the time.

Links

There’s several fabulous articles out there on large-scale fabric design. Reading up on CLOS theory is also useful in this context. Here’s a few articles you might like to view.

Cheap Network Equipment Makes a Better Data Centre
Greg Ferro, etherealmind.com

Construct a Leaf Spine design with 40G or 10G? An observation in scaling the fabric.
Brad Hedlund, bradhedlund.com

Full mesh is the worst possible fabric architecture
Ivan Pepelnjak, blog.ipspace.net

Clos network
Wikipedia



Ethan Banks writes & podcasts about IT, new media, and personal tech.
about | subscribe | @ecbanks

Comments

  1. nevynxxx

    “Conversely, if this fixed configuration SX-1036 were instead some sort of chassis switch, it would be possible to scale the spine more widely via additional line cards”

    Since you are calculating the theoretical maximums, this doesn’t make any sense to me… Once you populate the last port in the last line card in that chassis you have exactly the same problem ;) Granted a chassis will usually have a lot more ports before that happens…

    1. Post
      Author
      Ethan Banks

      Yes, it’s exactly fair to point out that this is an exercise in finding out the maximums, but I’m really just diverging for a moment to make the point that fixed config switches do have a hard limit. (An obvious point, I know.) This design Mellanox laid out is at its maximum capacity. This fabric can’t get bigger – the spine tier (and aggregation leaf tier) are maxxed out. You can’t add more spine switches to gain further scale – the design is maxxed with 36 leaf switches in the next tier up. The big deal is that there’s no room for growth. But…what if the spine and aggregation leaf tier were modular switches with higher 40GbE density? Say the Cisco Nexus 6004? Or Arista 7500E chassis with their 36 port 40GbE line card? Then you can grow the fabric incrementally over time by adding additional ports into the chassis, assuming ECMP capabilities scale as well. Hope that makes sense.

      1. nevynxxx

        Yeah, that makes a bit more sense. I guess a lot will depend on how you are placing the physical boxes too as you scale.

  2. Wes Felter (@wmf)

    You can always go bigger by adding another layer.

    I find this design a little perplexing and I wonder if it’s a good idea to spend much time worrying about it. For example, why do all 12 uplinks from the TOR go to a single switch? They should go to different switches for path diversity.

  3. Ross Alexander

    Assuming the leaf-spine is running some sort of fabric protocol capable of of ECMP I think most designs are much of a muchness. The requirement to all possible paths between any two hosts having exactly the same cost forces a very symmetric topology.

    However I was very interested in the host to fabric connection. Ignoring for the moment HPC designs where losing connectivity to a compute host is tolerable I’m making the assumption that some sort of virtual switch is running on each host with up a number of virtual clients. You then need to connect each host to two different switches, either TOR or spine switches. If you want to run active/active do you use naive virtual source port ID hashing, MLAG or have the vSwitch as a full member of the fabric?

    So can TRILL/SPB/IS-IS/OSPF scale to having hundreds of members and would the size and (computational) of the fabric outweigh the traffic throughput advantages?

    In the limit to get the most out of ECMP fabrics you need to have a highly symmetric distribution of compute/storage/north-south gateways combined with a highly entropic traffic pattern to maximize hashing efficiency?

    As an aside I’ve been playing around with Graphviz and mininet to look at various topologies and how they would behave as an Openflow fabric but I can’t figure out the best way to model host connections yet.

  4. Pingback: There is cost per port and then there is cost per port... - Plexxi

Comments are closed.