Ethan Banks Not writing about IT.

Should Monitoring Systems Also Perform Mitigation?

S

In a recent presentation, I was introduced to Netscout’s TruView monitoring system that includes the Pulse hardware appliance. The Pulse is a little guy that you plug into your network, where it phones home to the cloud. When it comes online, you’ll see it in your TruView console, and can configure it to do what you like. The purpose of a Pulse is to run networking monitoring tests, such as transactional HTTP or VoIP, from specific remote locations, and report the test results centrally. In this way, you can tell when certain outlying sites under your care and feeding are underperforming.

As far as remote network performance monitoring systems go, TruView is similar to NetBeez, ThousandEyes, and doubtless some others. Each of these solutions has their pros and cons. They are useful. They are necessary. They do their jobs well, not unexpectedly for a market that’s got a lot of years behind it. We need monitoring, yes — even in the age of SDN. But I believe monitoring could eventually evolve and couple itself with SDN to become something more powerful.

Historically, monitoring solutions have been very good at alerting you when something has gone awry. Shiny red lights and sundry messages can tell us when a transaction time is too high, an interface is dropping too many packets, database commits are taking too long, or a WAN link’s jitter just went south. That information is wonderful, but doesn’t resolve the issue. A course of action is required.

Perhaps the future of monitoring is not in the gathering of information, but in the actions taken on the information. This is where the more interesting bits of software defined infrastructure come into play. Software defined infrastructure is admittedly immature, lacking in standards, and fraught with vendor contention. But I believe that for monitoring solutions to have long-term viability, they will need to have mitigation engines that can react to certain infrastructure problems. I’m presuming (perhaps laughably so) that we’ll have some modicum of software-defined interfaces eventually. But let’s say that happens, such that it becomes possible for developers to write monitoring solutions with mitigation engines for software defined infrastructure. Isn’t that a logical progression?

To make my point here, some solutions already do this sort of thing. Consider SD-WAN. If WAN links were my only consideration, would I need a separate transactional monitoring system to tell me that a given link fell far below the quality required for a voice call? Not in an SD-WAN world. I would have configured a policy such that my voice call would have been routed across a link capable of meeting a voice SLA. SD-WAN does both monitoring (maybe not transactional, but monitoring all the same) and takes action if required. The transactional monitoring supplied by a standalone system is — well, not uninteresting — but less interesting at that point.

Admittedly, this turns monitoring tools into something else entirely. They go from being polling engines and stats collectors into policy and configuration engines with a great deal of logic and complexity required, especially if they are to work on disparate network topologies. But as the network becomes more programmatically accessible, monitoring seems like mere table stakes – the bare minimum required to have a viable tool. Reconfiguring the network to maintain a predefined SLA seems like a logical long-term goal. Don’t just tell me about the problem. Fix it.

As an aside, SD-WAN is perhaps an unfair role model to set out here. SD-WAN has a limited problem scope. An SD-WAN forwarder only has to monitor the virtual links between it and other forwarders in the network, and make a forwarding decision across those links. In addition, the action set is limited. Forwarding across a large network with complex attributes made up of any number of vendors’ gear is a somewhat different scope than the one the SD-WAN vendors are solving. Still. There’s a startup idea here for someone.

3 comments

  • He Ethan,

    good post and having done an extensive stint administrating various monitoring applications for an ISP there is a lot of room for improvements in the network area.

    Much like with automation we are again playing catch-up to the system and application engineers who have had this on their radar much longer (think event handlers in Nagios, watchdogs or even Puppet checking for libraries and applications and installing/starting when not present/running).

    I believe visibility is the key in making decisions at any level and in the end we need to think about the whole picture and ask questions like “should I restart this service on a system or should the application handle the failure” or “should I inject alternative routes when confronted with bandwidth issues or should capacity management play a role here”..

    In any case monitoring is an interesting area to follow and see where things go.

    regards,

    Alan

  • it is very difficult for software development companies to quickly solve all the requirements for every business model. Some innovative company with enough focused resources may be able to tackle all requirements one day, but until that happens, open source initiatives and software companies/vendors need to focus on open access and integration via standards and APIs. Enterprises need to focus their resources on integrating those tools together with their processes to automate their environment to manage service lifecycles – designs, implementations and operations. This automation includes the monitoring and mitigation which may be handled by more than one tool, but lets face it, what good it monitoring if IT departments aren’t rapidly reacting to the data?

    In summary, I agree with your assessment, we should be doing something with the monitoring data to resolve problems quickly. Alerting should let us know when issues arise that cannot be resolved automatically so we can then resolve, iterate, and improve mitigation attempts the next time an error occurs. Reporting can tell us how many times our mitigation via automation has resolved incidents vs. manual intervention to help tell a story.

    Regards,
    Jason

By Ethan Banks
Ethan Banks Not writing about IT.

You probably know Ethan Banks because he writes & podcasts about IT. This site is his, but covers other stuff.

Get the details on his about page.