From the blog.

Managing Digital Racket
The more I tune out, the less I miss it. But that has presented me with some complex choices for a nuanced approach to curb
Complexity – My Friend, My Enemy
Over my years of network engineering, I've learned that the fewer features you can implement while still achieving a business goal, the better. Why? Fewer

The Principle of Same-Same in Physical Network Design

837 Words. Plan about 5 minute(s) to read this.

In modern network architecture, most designs are redundant, often all the way through. Hosts uplink to two different ToR switches. Those ToR switches usually have two uplinks to a distribution layer or potentially more uplinks in leaf-spine designs. Spine switches uplink to a pair of core switches. Physical firewalls are deployed as clusters. Multiple connections are made from an organization to the Internet. Application delivery controllers are used as the connection point for clients, abstracting away the multiplied real servers that sit behind them.

All of this redundancy has one chief benefit — the elimination of single points of failure (SPOF). The term “SPOF” indicates a failure that, by itself, would cause a service interruption. When a SPOF is eliminated, that means that the IT engine is tolerant of that one component failing. Users should be able to continue working with minimal impact, despite the failure. Storage RAID is an example of this, where a single disk (and sometimes multiple disks) can fail in an array with no loss of data or interruption in service.

IMG_0073

An interruption in airplane service due to a system failure would really bother me, mostly because of gravity. Therefore, I choose to believe commercial jets are highly redundant.

All of this redundancy in IT has the downside of adding complexity and cost to the overall design. Cost is what it is. An organization is willing and able to absorb the expense of redundancy, or it isn’t. For IT practitioners, the larger problem is that of complexity. The hardware and software required to make a system redundant is its own headache that can ironically introduce fragility into a system that it’s meant to bring robustness to. (See David Meyer’s talk on this topic if it piques your interest.)

One of the ways to reduce the complexity in redundant networking schemes is what I call the principle of same-same.

Simply put, “same-same” means that what you do in one place,
you match exactly in the other.

The best way for me to explain this is by way of example.

  1. When uplinking a host to two access-layer switches, use the same port number on both switches. If you plug the host’s first NIC into port 12 of ToR switch 1, plug the host’s second NIC into port 12 of ToR switch 2. In addition, port 12 on the ToR switches should have identical configurations.
  2. When configuring redundant switches, configure them identically. Both switches should have the same QoS, management, and routing configurations, consistent port descriptions, matching access-lists, etc.
  3. When specifiying redundant core switches, they should be identical hardware. If they are fixed configuration switches, they should be the same model, optioned identically. If they are chassis switches, they should have the same supervisor engines and line cards.
  4. When building a redundant network at a different data center, either for disaster recovery or for “active/active” application designs, the networks as a whole should be identical. Even if only used for DR, a second data center network needs to be counted on to behave identically to the first data center.

This approach to building out redundant physical networks helps to reduce overall system complexity.

First, troubleshooting becomes easier. For instance, an engineer that can count on a server being plugged into the same port on both ToR switches, both switches containing the same access-lists, etc., will have an easier time diagnosing a problem.

Second, performance is predictable. Mismatched network equipment means that performance of applications can vary depending on what path is taken through the infrastructure. When equipment is identical, there should never be a performance problem due to path, except in the case of a partial failure such as an optic going bad.

Third and related to the second point is that capacity planning is easier. Redundant network designs should never exceed 50% utilization on any particular path, assuming a dual design. The idea is that if one path fails, the redundant path will need to take the entire load. If the redundant path is a lower performing path either in raw speed or packet-per-second forwarding capability, its ability to handle a full load is compromised. Mismatched equipment makes capacity planning more difficult.

In summary, building network infrastructure “same-same” in multiple locations is one way to ensure that applications will perform consistently, no matter what pipes they are flowing through. In addition, an element of randomness is reduced in the system when redundant networks match identically. Reducing randomness reduces complexity – and that’s a good thing.

Redundancy is never the time to think, “I can put this old piece of hardware in place, because it’s just a backup.” Redundancy is NOT simply a backup. Rather, redundancy is the only thing keeping an organization moving forward in the case of an inevitable failure. Don’t think of redundancy as a the spare tire “donut” in the trunk, where you can keep going as long as you travel slowly enough. Rather, redundancy is the full-size spare that’s required for your applications to continue running at normal speed.