From the blog.

Managing Digital Racket
The more I tune out, the less I miss it. But that has presented me with some complex choices for a nuanced approach to curb
Complexity – My Friend, My Enemy
Over my years of network engineering, I've learned that the fewer features you can implement while still achieving a business goal, the better. Why? Fewer
IMG_2558

Scalability Is A Matter Of Context

1,168 Words. Plan about 7 minute(s) to read this.

The question, “How big can I make it?” pervades all of IT. This is the scale question, and is at the root of such queries as, “How many shelves can I add to this storage array?” “How many forwarding table entries will this switch hold?” and “How many cores am I licensed for before I need to upgrade?”

In IT, we’re all worried about how a technology we invest in today will grow with us tomorrow. No one wants a shortsighted purchase to result in a forklift upgrade during the next budget cycle. We want to get 3, 5, 7, or even 10 years out of IT purchases.

This is a valid perspective borne out of experience. Disrupting IT operations to install something new is hard, meaning we like to limit those sorts of projects to every few years to minimize staff baldness and executive ire. And so, we buy as big as we can with every expectation of expanding our technology investment with a minimum of complexity. Who can argue with that logic? (I could actually, but that’s a topic for another day.)

Here’s another viewpoint. Not everything has to scale infinitely to be useful. Scalability does matter, but it matters only in certain contexts and frames of reference. Therefore, not every new technology that comes down the line has to scale globally to have a use-case.

Let’s consider a few examples.

SD-WAN Forwarding & Scale

I wrote recently about SD-WAN’s application-oriented best-path. I made a basic comparison between traditional routing protocols and SD-WAN forwarding to point out that SD-WAN considers an application’s specific needs before making a forwarding decision. Traditional routing protocols don’t. More to the point, traditional routing protocols can’t. How would that scale? How could application forwarding requirements and state be maintained across a global infrastructure, especially if the application characteristics and policy are not distributed centrally? I have heard of some efforts in this space, but it’s not an easy problem to solve, let alone the complex metrics required. (See the links at the bottom of this article for much more information on this.) As an industry, we’ve done adequately with QoS and traffic engineering to hammer out the most egregious application challenges — a coarse approach that’s difficult to manage. But…it scales.

Indeed, application-oriented routing presents a functional scaling problem. Thus, in the context of replacing traditional routing with SD-WAN forwarding globally, it’s unfair to compare the two. That said, SD-WAN doesn’t purport to function in a global context. SD-WAN targets a specific use case, typically that of an enterprise WAN carrying a limited number of applications that need particular treatment. That’s not infinite scale. That’s enterprise WAN scale. That’s likely to be a few thousand sites or less — in most cases, much less. Not tens of thousands of sites or more. So, does SD-WAN’s application-oriented best path computation need to scale like OSPF or BGP does to be viable? It does not. It just needs to scale well enough to replace traditional routing in a specific WAN cloud.

SDN Controllers & Scale

Another victim of the “lack of scalability” critique is the humble SDN controller. I think of SDN controllers as arbiters between physical or virtual network devices southbound and applications northbound. The controller runs or provides an interface to applications that express network intent. To perform this goal adequately, an SDN controller doesn’t need to scale infinitely. Centralized SDN controllers don’t have to be “one controller for the Internet” to be useful. They don’t even have to be one controller for a single network.

The centralized SDN controller model is not quite “going back to centralization like we did in the olden days and found out it was awful.” It’s applying a new control paradigm to limited domains for the explicit purposes of forwarding flexibility and improved time to market that distributed routing protocols don’t accommodate in certain situations. Controller failures don’t have to result in massive domain outages. Controller failures don’t have to represent a network-wide single point of failure. Proper design will mitigate that risk, at least to a point. It’s reasonable to point out that bad code or a bad human can propagate a problem through a lot of network infrastructure in a hurry using a central point of control. Even so, I’m unsure that the risk of this in an SDN controlled network is significantly greater than the “one device at a time” management approach.

Looking at the problem another way, an application that configures the network through an SDN controller is as prone to bad code, bad design, or even human error, as networking today. It is possible for a single element in a distributed system to clobber the entirety of the distributed system, if that system was designed poorly. Even in sound designs, there are things I can think to type at a single device CLI that will blackhole forwarding in the entire forwarding domain — no SDN controller required.

Failure domains must be contained, whether managed by an SDN controller, a human, or a distributed forwarding protocol stack. And thus, we see SDN controllers themselves rolled out hierarchically, perhaps federated with other SDN domains, but having domains of their own. We see SDN controllers that are catalysts for specific network functions, as opposed to replacing all network functions. We see networks leveraging a mix of traditional networking with an SDN layer on top for exception processing. SDN is turning out to do special things — but not all the things.

All of this said, SDN controllers are meant to scale up to handle a large number of devices or operations per second. Please don’t take my diatribe as “SDN controllers can’t or don’t scale.” Of course, they can and do, up to a point that seems more than reasonable for a single domain. See the links below for more data here.

OpenDaylight Performance Testing Wiki

Comparing SDN Controllers: Open Daylight and ONOS

Parting Point

Scale is a relative term. While every technology needs to scale to some point to be useful to IT practitioners, not every technology needs to scale infinitely. Every technology has a context in which it is viable — where it proves to be the best choice. But in another context, the opposite technology might rise to the surface as more appropriate. Don’t be religious about such a decision. Know your business need well, research the technology thoroughly, plan for the future, and choose wisely. Don’t pick a tool that solves someone else’s problem.

Recommended Additional Reading

These posts of Ivan Pepelnjak’s and Russ White’s take on the practical challenges faced by distributed routing protocols and describes why they accomplish what they do in the way that they do. They were also direct or indirect responses to my “SD-WAN best path” post, addressing issues in detail that I did not raise at all in the piece, but makes for a well-rounded discussion.

Routing Protocols and SD-WAN: Apples and Furbies (Ivan Pepelnjak)

Reaction: SD-WAN and Multiple Metrics (Russ White)