The Principle of Same-Same in Physical Network Design

837 Words. Plan about 3 minute(s) to read this.

In modern network architecture, most designs are redundant, often all the way through. Hosts uplink to two different ToR switches. Those ToR switches usually have two uplinks to a distribution layer or potentially more uplinks in leaf-spine designs. Spine switches uplink to a pair of core switches. Physical firewalls are deployed as clusters. Multiple connections are made from an organization to the Internet. Application delivery controllers are used as the connection point for clients, abstracting away the multiplied real servers that sit behind them.

All of this redundancy has one chief benefit — the elimination of single points of failure (SPOF). The term “SPOF” indicates a failure that, by itself, would cause a service interruption. When a SPOF is eliminated, that means that the IT engine is tolerant of that one component failing. Users should be able to continue working with minimal impact, despite the failure. Storage RAID is an example of this, where a single disk (and sometimes multiple disks) can fail in an array with no loss of data or interruption in service.

IMG_0073
An interruption in airplane service due to a system failure would really bother me, mostly because of gravity. Therefore, I choose to believe commercial jets are highly redundant.

All of this redundancy in IT has the downside of adding complexity and cost to the overall design. Cost is what it is. An organization is willing and able to absorb the expense of redundancy, or it isn’t. For IT practitioners, the larger problem is that of complexity. The hardware and software required to make a system redundant is its own headache that can ironically introduce fragility into a system that it’s meant to bring robustness to. (See David Meyer’s talk on this topic if it piques your interest.)

One of the ways to reduce the complexity in redundant networking schemes is what I call the principle of same-same.

Simply put, “same-same” means that what you do in one place,
you match exactly in the other.

The best way for me to explain this is by way of example.

  1. When uplinking a host to two access-layer switches, use the same port number on both switches. If you plug the host’s first NIC into port 12 of ToR switch 1, plug the host’s second NIC into port 12 of ToR switch 2. In addition, port 12 on the ToR switches should have identical configurations.
  2. When configuring redundant switches, configure them identically. Both switches should have the same QoS, management, and routing configurations, consistent port descriptions, matching access-lists, etc.
  3. When specifiying redundant core switches, they should be identical hardware. If they are fixed configuration switches, they should be the same model, optioned identically. If they are chassis switches, they should have the same supervisor engines and line cards.
  4. When building a redundant network at a different data center, either for disaster recovery or for “active/active” application designs, the networks as a whole should be identical. Even if only used for DR, a second data center network needs to be counted on to behave identically to the first data center.

This approach to building out redundant physical networks helps to reduce overall system complexity.

First, troubleshooting becomes easier. For instance, an engineer that can count on a server being plugged into the same port on both ToR switches, both switches containing the same access-lists, etc., will have an easier time diagnosing a problem.

Second, performance is predictable. Mismatched network equipment means that performance of applications can vary depending on what path is taken through the infrastructure. When equipment is identical, there should never be a performance problem due to path, except in the case of a partial failure such as an optic going bad.

Third and related to the second point is that capacity planning is easier. Redundant network designs should never exceed 50% utilization on any particular path, assuming a dual design. The idea is that if one path fails, the redundant path will need to take the entire load. If the redundant path is a lower performing path either in raw speed or packet-per-second forwarding capability, its ability to handle a full load is compromised. Mismatched equipment makes capacity planning more difficult.

In summary, building network infrastructure “same-same” in multiple locations is one way to ensure that applications will perform consistently, no matter what pipes they are flowing through. In addition, an element of randomness is reduced in the system when redundant networks match identically. Reducing randomness reduces complexity – and that’s a good thing.

Redundancy is never the time to think, “I can put this old piece of hardware in place, because it’s just a backup.” Redundancy is NOT simply a backup. Rather, redundancy is the only thing keeping an organization moving forward in the case of an inevitable failure. Don’t think of redundancy as a the spare tire “donut” in the trunk, where you can keep going as long as you travel slowly enough. Rather, redundancy is the full-size spare that’s required for your applications to continue running at normal speed.



Ethan Banks writes & podcasts about IT, new media, and personal tech.
about | subscribe | @ecbanks

Comments

  1. Perry

    I would add a couple of things. A redundant path is just that, it is redundant doing nothing waiting for a failure.
    You blurred the lines between redundant and Active/Active resilience a little with your make sure the backup can take 100% of the load point but it is still a valid one. I would also say that there are certain things which do need to differ between same-same devices such as things like priorities for routing (think HSRP or BGP MED for example). You want to make sure that you get a symmetrical path through statefull devices like firewalls, else everything can break.
    Lastly I cannot recommend enough the value of regularly testing resilient components to ensure that the resilience is there and functional for when you need It most. An approach that I have seen in the field though not common is to schedule a flip to run on alternate sides of a same-same setup over consecutive weeks. Not always practical but it ticks the compliance box for continuity of business testing.

    1. Post
      Author
      Ethan Banks

      Yep, completely agree on the point about routing priorities. One of those points I was pondering whether or not to put it in; I opted not in an effort to keep the piece as simple as possible built around the specific point I was making. Also hoping that it would be an obvious one. Maybe not? In any case, I completely agree, and am glad you brought it up.

      The symmetrical path issue is a good one as well. It’s one of several other points worth making, all swirling around in my Evernote as I get ready for a presentation at Interop on active/active data centers. I am planning to do a few more posts on the topics, like this one, that are related. I tried to stay more focused on physical issues, less focused on config in this piece. Maybe a decent follow-up is on managing paths through a redundant infrastructure.

      1. Marc

        I am a massive fan of the same same logic, and this article is music to my ears. However the refeence to the plane triggered something.
        Vendor resilience. On a plane where there are multiple components they will always be different vendors and they will effectively “vote” another out if it begins mal functioning.
        This is valid in networks too. Do you want both of your core switches being vulnerable to exactly the same software bug or hardware malfunction?
        It becomes extreamly complex and expensive having say a Cisco A side and Brocade B side, but that does give you another layer of protection. Effectively the vendor is a SPOF

        1. Post
          Author
          Ethan Banks

          Marc, yes, I’ve run into this argument before. I’ve had the discussion in the context of security, as well, the big idea being if you want to better mitigate your security risks, a defense-in-depth strategy implies different vendors scanning traffic along the way. Don’t make your firewall & IDS/IPS vendors the same, etc.

          As far as network resilience goes, I’ve not personally seen dual cores from different vendors, as in Core1 & Core2 are from different vendors. I have heard of, but never actually experienced, A&B network designs, where A network might be all Cisco and B network might be someone else. I have also seen A&B designs where the vendor might be the same, but the routing protocol is different. That’s another way to work around the “everyone’s got the same bug” problem, because you’re into a different code base feeding the forwarding plane.

          I’ll also add that I have been a victim of this “everyone’s got the same bug” problem you raise, but never catastrophically where there was a resulting DC outage. More like annoying bugs. Things like slow memory leaks in a routing process, that kind of thing where, as long as you catch it, you can find a workaround before it becomes a problem. Or alternately, where the bug isn’t likely to hit both devices at the same time. Still, the common code base is a danger inherent in the same-same philosophy. That said, my personal opinion is that the benefits of reduced complexity outweigh the risk of the occasional bug. I’m sure someone has a good war story that might make me re-think that position.

        2. Graeme

          Redundancy can be taken to many levels. Reminds me of an interview with the head of development for the first Airbus fly-by-wire system. The design and implementation of redundant systems were by teams that weren’t allowed contact with each other so there was no accidental “contamination”.

    1. Post
      Author
  2. Leighton Evans

    Great post Ethan as usual. As an extension to this, one of the main reasons most orgs don’t have consistent ‘same same’ architectures and configs is cost. Capex and Opex cycles / limitations always introduce elements into the network which prevent this approach. The overall challenge of designing, building and assuring the network is very rarely technically driven only. However, as you’ve rightly pointed out, if cost is no object, it’s hard to build an argument against ‘same same’ as a core strategic paradigm for any network.

  3. Pingback: The Principle of Same-Same in Physical Network Design

  4. Pingback: Secret Sunday #1 - LameJournal

  5. saeed

    there some kind of redundancy sites for an organisation, this kind that you introduced is the “Hot Redundancy (Back up) sites” that is very expensive even for a small organisation, there are some other kinds ( as they called as Cold and Warm redundancy site … )
    every kind is usable based on situation and project …;

  6. Anas Tarsha (@anastarsha)

    These are very good points Ethan. Predicability in network design is everything especially when you have a redundant network. Having to troubleshoot traffic flows and features behavior takes away from the whole concept of designing reliable networks.

    If I may add one more thing that should be factored in is the software. Assuming your data center switches are made but the same vendor (which is the case most of the time), do you run the same code on all the switches or do you run a different version of code on each switch? Each approach has its pros and cons. Running is the same code across the board gives you predictability and ease of management but puts you at risk and forces you to upgrade all switches when hitting a critical bug. On the other hand running a different code on each switch eliminates that risk but makes managing performance and features much more difficult.

    The best approach I have seen is to deploy one consistent code per network layer if possible so for example you would run the same code of all access layer (ToR) switches. This is usually a good practice regardless whether you have a redundant design or not but i think it becomes even more important when applying the “same-same” concept to network design because again it gives you predicability.

    What are your thoughts on this?

  7. Jason Edelman (@jedelman8)

    Nice article, Ethan. Agree with you and I’m a fan of using “symmetry” as the term to describe this. Very often or not do I see DR or secondary data centers mimic the primary. While I don’t necessarily agree with it, budgets and such, often play a hand in this. That said, as software-only solutions from companies such as VMware, PLUMgrid, and Nuage emerge, I do wonder the impact this can have. From a personal perspective, I’d be *more okay* not having the same hardware/physical design because the logical would be the same, but of course from an operational and performance perspective, same-same or symmetric designs would always be ideal.

    What do you think? :)

    Thanks,
    Jason

  8. Mike D

    I’d offer this RE: point 3—if the redundant link is being utilized in a “standby”mode (not active-active) and if it could be operated at a lower throughput (think 1gbe instead of 10gbe) then that becomes another cost question. Is the organization willing to try to lower TCO by accepting lower performance on the “standby” link?

  9. Pingback: Secret Sunday - Mr Ethan C Banks - LameJournal

Comments are closed.