Work Projects Keepin’ Me Busy: 10-Gig Rollout + New MDF + RMON

W

I am buried at work. That is to say, happily buried. I bore easily; it’s a mistake to me give too little to do. The more projects on my plate, and the more challenging they are, the happier I am. These days, I’m just buried with stuff to do – it’s awesome. Not all of them are super-interesting, but here’s a few of more Cisco-oriented projects I’m dealing with.

  1. 10-Gig Rollout (core/dist). We have a growing need for 10-gigabit ethernet in our cores. Intra-data center replication traffic among various SAN arrays is saturating some of the Gig-E links that interconnect the core 6500s. Last fall, we were seeing outDiscards climb on the etherchannel connecting core 6500s, because one member of an etherchannel was saturated. I put a stop-gap in place by changing the etherchannel load-balancing method to “src-dst-port”, which distributed the load more evenly across the members, and outDiscards pretty much went away. Even so, we need to get more pipe in between core switches – that’s just the way it is. The problem will only get worse, and changing the load-balancing method was just a finger plugging the leaking dike.

    To that end, I’m working through a project plan to test each step for the upgrade to 10-gig to interconnect the core 6500s. Of course, that means an IOS upgrade to support the WS-X6708-10G-3C. Which of course means CHOOSING an IOS to upgrade to (we’re going with a SafeHarbor release of some flavor). It’s a reasonably straightforward process to upgrade a 6500 with redundant sup engines, but you gotta have your ducks in a row to make sure you minimize risk to production traffic.

  2. 10-Gig Rollout (access). We have new equipment showing up that can uplink at 10-gig, but we don’t have any access-layer switches to support 10-gig. The 4900M looks like about the right fit for us in this area. I’ve got parts lists pulled together, and I’m working on reqs. The 4900M itself isn’t that pricey, but once you load up the 10-gig slots with modules, you can easily double the price of the chassis.
  3. New MDF. We don’t have a fiber distribution network for ethernet in the location I support. Of course, 10-gig is fiber-only at the moment. To roll out multimode fiber throughout the data center, we didn’t want to do a bunch of point-to-point runs – that gets ugly and unmanageable in a hurry. We decided to build a new MDF for it – we only need a few runs today, but these things tend to grow unexpectedly. You might think managing cabling deployments is not the job of a network engineer, but I’d much rather manage the initial deployment to make sure what’s being done makes sense and can be replicated a hundred times if we need to later on. So, yeah – if I’m going to be plugging cables into and out of this MDF for the next 10 years, I want it done as right as we know how to do it, right at the start. That process involves committing data center floor space, updating relevant floor plans, coordinating cabling & electrical contractors, managing equipment lists for new cabinets and racks, etc.
  4. RMON evaluation. Our Network Command Center manager approached my boss about deploying RMON to routers such that the router would send a trap to the SNMP manager in the NCC when the interface exceeded “X” percentage utilization. I started digging into this, and I think I have a good solution for it. I did a bit of RMON review and lab work prepping for CCIE. But I dug into it even deeper today, getting down into OID research, and starting to work up some sample code based on a nifty example I found in the Cisco IOS Cookbook. I’d much rather to use the code someone else already wrote, then tailor it for my needs, rather than reinvent the wheel. I’m not scared to write my own code, but why labor to accomplish something someone else already wrote? So, I should be able to submit RMON code for testing tomorrow, and then the question becomes how well it scales once it’s working.

    The challenge is this: we’ve got a number of large WAN clouds with up to several hundred routes each out at the edge. So…do we deploy a single RMON monitor out on each edge device? We have an automation tool that could push the code for us – I’m not worried about having to touch all the equipment. But I had this thought pop in my head – if we want the router to send a trap when the WAN link utilization is high, THERE’S NO GUARANTEE THAT THE TRAP WILL GET THERE. That SNMP UDP trap packet has to cross a WAN link that’s probably saturated, and UDP is an unreliable transport. Yeah, I could QoS the trap on the edge router, but that’s just yucky. So, we can’t, practically speaking, push this RMON code out to the edge of the WAN cloud. That means putting the RMON monitor on the head-end aggregation routers. Okay, fine – so let’s say I’ve got a 100+ serial sub-interfaces coming into a given router. How much of a CPU/memory load does RMON take? I’m sure 1 RMON monitor is no big deal. What about 10 monitor? What about 50? 100? More? I don’t know without testing it out to see how much of an impact RMON will have.

Miscellaneous problems I worked on this week:

  • Router would mysteriously pause for a minute when a user was authenticating via SSH. Seemed to be related to SSH version. When dumbing down to SSH version 1, life was good. SSH version 2 was a problem – I suspect a problem with the IOS on these particular routers. They aren’t running the standard IOS we typically deploy, so it’s the likely problem.
  • Customer getting poor performance on a PVC. Another engineer bridged me on to the call to review. I noticed that the IGP neighbor adjacency was dropping across one PVC (where the performance was poor), but not the other. I recommended that we down the problem serial subinterface, and try the customer’s transfer again. Problem resolved, bad PVC called into the provider. Seems like a simple problem to have resolved, right? Well, I got called in because the router employed a QoS policy I wrote and taught to the group. Since widely deployed QoS is new to our environment, it tends to get blamed when weird things happen on a circuit. I had to set everyone’s mind at ease that the QoS policy wasn’t the problem, and then diagnose what was REALLY going on. Welcome to my world…

10 comments

  • Must be a pretty nice world to live in ;) Sounds like you are having a lot of fun. You clearly love this stuff. And it shows in how organized and detailed you are in the way you describe things (complete with links and all).
    Someday I hope to experience the kind of “good busy” that a high level of skill demands of you. Right now, I’m taking a break from the “yucky busy” I’m having all day scanning barcode labels for our warehouse items. Different kind of busy…

  • Before I got into IT work, I was working as a bank teller. Teller work not much better than scanning barcodes. Maybe a little. You take in deposits, count cash, process withdrawals, and have the occasional cash shipment or drawer imbalance to keep it a little interesting.

    Tellering was a dead-end, so I ended up at Novell school, studying to be a Certified Netware Associate. That worked for me. It took a few months of schooling and passing some Novell exams, but I landed a contract job doing LAN support for a small company. My IT career took off from there…that was…WOW…13 years ago now! Man, I’m gettin’ old!!

  • Man I miss the novell days… Good old Netware 4.11. Western New York used to be a huge novell area too :(

  • Netware Nostalgia? Maybe… Interesting to learn how you got into Networking. Its not quite so easy anymore. Despite Cisco’s insistence on there being a skills shortage, this is a very hard to break into business these days.

    Interesting stuff about the 10Gbps rollout and the MDF. What would you need 10Gbps at the access layer for, though?

  • The first year of experience is always the hardest to get. After about two years of good hands-on it’s pretty easy to find a decent job. In LA a good network engineer would have to *try* to be unemployed for more than a month.

    As for 10G at the access layer, we’re rolling out new Sun hardware that does 10G. At least it has a 10G NIC, it will probably be a year or two before they start to utilize it, just like when 1G NICs came out on PCI busses that pushed something like 150Mb. We’ll be able to consolidate an entire rack of servers into one big one. If your fault-domain can stand to be that big it’s a solid win on the logistics side. What do you need that for? Media. Streaming lots and lots of media. Music, movies, pictures.

    On a side note, we’re starting architectural plans for the next-gen data center (when all of our current hardware gets devalued and we do forklift upgrades), and we’re assuming that in 5 years 10G server uplinks will be commonplace.

  • Uh wow… Loved your myspace page, Keith… I never thought I’d see Peter North, Carl Sagan and Jim Henson as collective heroes for anyone before. :-) But I guess there’s a first for everything.

  • What would you need 10Gbps at the access layer for, though?

    Bladecenters, baby. Take a bunch of 1U pizza box servers and throw them into a big ol’ bladecenter instead. Frees up power, BTUs, and RUs.

  • Yeah, that’s true Ethan… We’re in the area of Network Virtualization, so the speed per-NIC has to make up for it. For that reason, we should probably see copper 10G NIC’s pretty soon.

  • For the rmon problem, you can use informs instead of traps. They need acknowledgment from the snmp manager, so you get a more reliable method of delivering snmp data. Just be careful for any extra cpu/mem they might add to your router.

  • I think this is a case where you need to do internal monitoring and external probing. We have batted around the idea of RMON for things like more frequent CPU polling, at least in the short-term to get a look at patterns. But of course if someone’s power maintenance takes down your PDU (after they assure you it won’t…) you won’t be getting a trap. Periodic polling, at least with a simple ping every few minutes, is easy to set up, if a pain in the butt to maintain.

    The high utilization is a tough call. Personally I’d rather throw a polling box up. Solarwinds is dirt cheap and it reports on interface utilization. Not the most hardcore solution, but it’s cheap, fast and easy to use without being duct tape.

Ethan Banks is a podcaster and writer with a BSCS and 20+ years in enterprise IT. He's operated data centers with a special focus on infrastructure — especially networking. He's been a CNE, MCSE, CEH, CCNA, CCNP, CCSP, and CCIE R&S #20655. He's the co-founder of Packet Pushers Interactive, LLC where he creates content for humans in the hot aisle.

Newsletter