1,218 Words. Plan about 5 minute(s) to read this.
My Twitter blew up after I tweeted thusly.
Why did I tweet this?
Over my years of network engineering, I’ve learned that the fewer features you can implement while still achieving a business goal, the better. Why? Fewer features mean fewer things that can potentially go wrong. The less that goes wrong, the higher the network uptime. That’s a generalization we could pick at, but I believe it holds true overall.
I have an example of a lesson learned.
Going back almost 10 years, I was an architect at a site evaluating Cisco’s VSS. This was early days for VSS, and the code was not fully baked. Still, on a whiteboard it sounded like a good idea, and offered us a tidy way to uplink our dual-homed access switches to a VSS core without having to block uplinks. That was compelling, as we were still in the 1GbE uplink days, and link saturation was happening from time to time.
What is VSS and any of the similar systems that create a virtual chassis or MLAG capability? It is a complex, distributed control plane, where two or more physically distinct systems maintain constant communication so that they can behave as a single physical system.
I remember reading a detailed Cisco technical whitepaper on VSS that explained just how the magic happened. It sounded…okay. There were dedicated links between the switches. There was some special tagging going on. There were processes to assign which switch would be the master. There were discussions about how to prevent and recover from split-brain. There were notes on what would happen during a single supervisor engine failure. And so on. It all seemed well-thought out and carefully considered. And no doubt it was.
Yet in our lab testing, VSS was a disaster. At that point (very early days, please remember), the code was so underdone as to hard crash IOS and create a cascading failure between the two VSS members. One switch would crash hard, and reboot. On reboot, it would cause the other switch to crash, and so on. I remember walking into to the lab racks after we’d stood up the initial VSS pair, and looking at console output. They had been cascade crashing back and forth for hours, and none of us on the team had been doing any work. The VSS pair blew up all on its own, without even a test traffic load going through them.
This problem was so bad, that we couldn’t get our testing done. We thanked Cisco kindly, and shipped the test sups back.
Complexity kills, but sometimes we need it.
Now, years have gone by since that VSS story. I know of many folks with successful VSS implementations. 1.0 code is always risky. I’m not suggesting that you shouldn’t use VSS. If it’s working for you, have at it.
However, I am suggesting that the complexity introduced by VSS created fragility in the switching system. From that, I extrapolate that any sort of complexity can introduce fragility. This is not at all a new idea. I’ve heard it discussed most eloquently by David Meyer, who’s studied the issue deeply.
For the network engineer, there is irony here. Networkers have hard problems to solve for businesses. Sometimes, those hard problems call for complex solutions. If we assume that a consistently performant, always available network tends to match well with business goals, then complexity is an enemy. Simplicity is better. Less to break. A smaller dependency tree. Easier troubleshooting when something does go wrong.
Yet often (and here’s the irony), complexity is a friend simply due to the nature of the problem we’re trying to solve. Thus, we find ourselves experimenting with immature VSS code as a potential solution to engineer away uplink bottlenecks.
Or adding VXLAN overlays with a centralized controller to meet a segmentation requirement.
Or distributing policy using BGP in the form of flowspec to improve DDoS mitigation capabilities.
Or stretching L2 between data centers, and then having to implement a DCI protocol plus maybe LISP to help with traffic trombones, so that we can put any IP anywhere without disturbing the application.
Or introducing complex device profiling and authentication techniques replete with guest networks, quarantine networks, and encapsulated traffic, because BYOD. And because security.
Or creating WAN edge configurations with DMVPN, PfR, and QoS, because we have to get the most from those slow, expensive private WAN links we can’t afford to upgrade.
Or…need I go on? You can tell your own story of the lurking network horror found in a carefully crafted configuration stanza created by your own Frankensteinian hand.
Simplicity must make a comeback.
As networkers, the complexity we’ve engineered into our networks is indeed partly due to business requirements. Complexity genuinely has a place in our world.
At the same time, we receive the word from on high that we must accomplish X. And within the familiar comfort of our silos, we think about just how to accomplish X. We read. We research. We talk to our vendors. We do some lab work. And finally, we recommend a solution to X that we think will work.
Also from our silos, we probably make that recommendation. We do it without talking to the app people, the dev team, the storage admins, or even our managers if we can avoid it. We take the problem as stated to be gospel. Using the hammers we are comfortable using, we solve problems.
And now we’re screwed. All of us.
We’ve built these complex monster networks with so many dependencies, change is hard. Cloud is coming, or maybe even here, and we don’t know how to map the network we’ve got to what we’ll need next. We’ve entrenched so many special features so deeply inside of our network infrastructures, that we’ve locked ourselves in. Forget the vendors locking us in.
We did this to ourselves.
And we need to undo it. We need to go back to simple. The simpler, the better. We need to resist the temptation to hit problems with the hammers we’re used to. We need to leave our silos and start solving business problems as members of integrated IT teams. Our data center infrastructures work as cohesive, integrated application delivery systems. Why don’t we, as IT folks, follow that same integration model?
We need to be willing to professionally push back from the brink of complexity, and rally around the simpler solution. That might introduce conflict, yes. But done right, conflict will result in a better overall solution. Raising a cautionary note doesn’t make you a jerk, assuming you don’t act like one.
We need to get away from the network engineer performing complex voodoo to salvage a poorly conceived application architecture. We must be willing to demonstrate why the complex network solution is actually riskier than re-thinking the app. We won’t always prevail in those discussions, but we’ll never prevail if we don’t have the discussions at all.
As data center network designs have gone through the complexity roof, saner minds are pushing back, offering simpler solutions while still meeting business goals. Romana.io is an example. Cumulus Networks’ recently announced “L3 to the host” I feel is another.
I believe this simplicity movement should continue. We all need to get to the point that we eschew nerd knobs in favor of the simplest possible solution every time. That doesn’t mean we don’t end up with a complex solution at times. Certainly, we will. But let’s do the hard work required up front to avoid complexity if we can.
— John W Kerns (@packetsar) July 5, 2016
about | subscribe | @ecbanks