From the blog.

Managing Digital Racket
The more I tune out, the less I miss it. But that has presented me with some complex choices for a nuanced approach to curb
Complexity – My Friend, My Enemy
Over my years of network engineering, I've learned that the fewer features you can implement while still achieving a business goal, the better. Why? Fewer

Work Projects Keepin’ Me Busy: 10-Gig Rollout + New MDF + RMON

1,113 Words. Plan about 7 minute(s) to read this.

I am buried at work. That is to say, happily buried. I bore easily; it’s a mistake to me give too little to do. The more projects on my plate, and the more challenging they are, the happier I am. These days, I’m just buried with stuff to do – it’s awesome. Not all of them are super-interesting, but here’s a few of more Cisco-oriented projects I’m dealing with.

  1. 10-Gig Rollout (core/dist). We have a growing need for 10-gigabit ethernet in our cores. Intra-data center replication traffic among various SAN arrays is saturating some of the Gig-E links that interconnect the core 6500s. Last fall, we were seeing outDiscards climb on the etherchannel connecting core 6500s, because one member of an etherchannel was saturated. I put a stop-gap in place by changing the etherchannel load-balancing method to “src-dst-port”, which distributed the load more evenly across the members, and outDiscards pretty much went away. Even so, we need to get more pipe in between core switches – that’s just the way it is. The problem will only get worse, and changing the load-balancing method was just a finger plugging the leaking dike.

    To that end, I’m working through a project plan to test each step for the upgrade to 10-gig to interconnect the core 6500s. Of course, that means an IOS upgrade to support the WS-X6708-10G-3C. Which of course means CHOOSING an IOS to upgrade to (we’re going with a SafeHarbor release of some flavor). It’s a reasonably straightforward process to upgrade a 6500 with redundant sup engines, but you gotta have your ducks in a row to make sure you minimize risk to production traffic.

  2. 10-Gig Rollout (access). We have new equipment showing up that can uplink at 10-gig, but we don’t have any access-layer switches to support 10-gig. The 4900M looks like about the right fit for us in this area. I’ve got parts lists pulled together, and I’m working on reqs. The 4900M itself isn’t that pricey, but once you load up the 10-gig slots with modules, you can easily double the price of the chassis.
  3. New MDF. We don’t have a fiber distribution network for ethernet in the location I support. Of course, 10-gig is fiber-only at the moment. To roll out multimode fiber throughout the data center, we didn’t want to do a bunch of point-to-point runs – that gets ugly and unmanageable in a hurry. We decided to build a new MDF for it – we only need a few runs today, but these things tend to grow unexpectedly. You might think managing cabling deployments is not the job of a network engineer, but I’d much rather manage the initial deployment to make sure what’s being done makes sense and can be replicated a hundred times if we need to later on. So, yeah – if I’m going to be plugging cables into and out of this MDF for the next 10 years, I want it done as right as we know how to do it, right at the start. That process involves committing data center floor space, updating relevant floor plans, coordinating cabling & electrical contractors, managing equipment lists for new cabinets and racks, etc.
  4. RMON evaluation. Our Network Command Center manager approached my boss about deploying RMON to routers such that the router would send a trap to the SNMP manager in the NCC when the interface exceeded “X” percentage utilization. I started digging into this, and I think I have a good solution for it. I did a bit of RMON review and lab work prepping for CCIE. But I dug into it even deeper today, getting down into OID research, and starting to work up some sample code based on a nifty example I found in the Cisco IOS Cookbook. I’d much rather to use the code someone else already wrote, then tailor it for my needs, rather than reinvent the wheel. I’m not scared to write my own code, but why labor to accomplish something someone else already wrote? So, I should be able to submit RMON code for testing tomorrow, and then the question becomes how well it scales once it’s working.

    The challenge is this: we’ve got a number of large WAN clouds with up to several hundred routes each out at the edge. So…do we deploy a single RMON monitor out on each edge device? We have an automation tool that could push the code for us – I’m not worried about having to touch all the equipment. But I had this thought pop in my head – if we want the router to send a trap when the WAN link utilization is high, THERE’S NO GUARANTEE THAT THE TRAP WILL GET THERE. That SNMP UDP trap packet has to cross a WAN link that’s probably saturated, and UDP is an unreliable transport. Yeah, I could QoS the trap on the edge router, but that’s just yucky. So, we can’t, practically speaking, push this RMON code out to the edge of the WAN cloud. That means putting the RMON monitor on the head-end aggregation routers. Okay, fine – so let’s say I’ve got a 100+ serial sub-interfaces coming into a given router. How much of a CPU/memory load does RMON take? I’m sure 1 RMON monitor is no big deal. What about 10 monitor? What about 50? 100? More? I don’t know without testing it out to see how much of an impact RMON will have.

Miscellaneous problems I worked on this week:

  • Router would mysteriously pause for a minute when a user was authenticating via SSH. Seemed to be related to SSH version. When dumbing down to SSH version 1, life was good. SSH version 2 was a problem – I suspect a problem with the IOS on these particular routers. They aren’t running the standard IOS we typically deploy, so it’s the likely problem.
  • Customer getting poor performance on a PVC. Another engineer bridged me on to the call to review. I noticed that the IGP neighbor adjacency was dropping across one PVC (where the performance was poor), but not the other. I recommended that we down the problem serial subinterface, and try the customer’s transfer again. Problem resolved, bad PVC called into the provider. Seems like a simple problem to have resolved, right? Well, I got called in because the router employed a QoS policy I wrote and taught to the group. Since widely deployed QoS is new to our environment, it tends to get blamed when weird things happen on a circuit. I had to set everyone’s mind at ease that the QoS policy wasn’t the problem, and then diagnose what was REALLY going on. Welcome to my world…