I went back to work last Thursday. The day started out as a normal day. Plowing through endless piles of e-mail. Trying to figure out what projects I should actually be working on. Reviewing change controls. After a while, a little problem comes up with a DSL line on one of our buildings, so I head over to take a look at it. Not 30 minutes away from my desk, my cell goes off with an “all hands on deck” call because of a problem in one of our data centers.
I speed-walk back across the campus to my desk, and hop on the conference bridge. One of our core 6500s took a nosedive, and not a graceful one. We run dual-core 6500s, and each 6500 has dual sups. We still don’t know exactly what happened, but one of the sups on one of the 6500s went into a death spiral, but not badly enough to cause an SSO/NSF failover to the hot standby sup. The failover did eventually happen, but it took about 7 minutes. During those 7 minutes, life was ugly. EIGRP adjacencies flipping out, HSRP groups confused, firewall HA clusters losing their minds, etc. But after the sup finally failed over, traffic began forwarding normally – mostly. Something about the crash took a WS-X6548 card down, too. The card showed ethernet link for everything connected, but was only showing inOctets – no outOctets. That caused some firewalls to lose their minds, because the HA between the firewalls was still all confused. Async routing through firewalls resulted, and oof…what a mess. Once the badly behaving WS-X6548 card was identified and the firewall ethernet connections moved to a different card, life settled back down.
I volunteered to open the TAC case to try to figure out what happened. I looked for the crashinfo ahead of time with a good ol’ “dir all”, but didn’t see anything. I posted a “show tech” for the TAC engineer, hoping that would help. No luck. Without the crashinfo file (which no one can explain why it wasn’t created), we have no idea why the sup took that nasty nosedive…or why it took about 7 minutes for the failover to occur.
We do know that the IOS we’re on is deferred. Obviously, it wasn’t deferred at the time the 6500s went into production – but it is now. We don’t know why it’s deferred (TAC is researching that for me, since the deferral notice for our particular IOS is not publicly available on cisco.com anymore), but deferred usually means Not Suitable For Production. An IOS upgrade was on the board for this year anyway, since we need to go higher to support the 10 gig cards we’re installing. I’ve been hoping that something in the 12.2(33)SXH family gets certified by the Safe Harbor program, since there’s a lot of advanced IOS features in there I’d like to have at my disposal on my core cats.
So my first day back at work was not a gentle easing back into things. Nope – got thrown to the wolves right on day one. <sigh> The kind of an outage we experienced certainly would have hurt less if we had the staffing levels to do more design and less implementation. On my team, we’re usually so busy making stuff go, that we rarely get the chance to do design review and get improvements on the board. We get to do redesign when Something Really Bad happens, and we’re forced to react. That’s life, I guess…but it means my life between now and our peak season (the holidays) are probably going to be a smidge more stressful than I might have liked.