CCIE Week 1 – Core Crash

C

I went back to work last Thursday. The day started out as a normal day. Plowing through endless piles of e-mail. Trying to figure out what projects I should actually be working on. Reviewing change controls. After a while, a little problem comes up with a DSL line on one of our buildings, so I head over to take a look at it. Not 30 minutes away from my desk, my cell goes off with an “all hands on deck” call because of a problem in one of our data centers.

I speed-walk back across the campus to my desk, and hop on the conference bridge. One of our core 6500s took a nosedive, and not a graceful one. We run dual-core 6500s, and each 6500 has dual sups. We still don’t know exactly what happened, but one of the sups on one of the 6500s went into a death spiral, but not badly enough to cause an SSO/NSF failover to the hot standby sup. The failover did eventually happen, but it took about 7 minutes. During those 7 minutes, life was ugly. EIGRP adjacencies flipping out, HSRP groups confused, firewall HA clusters losing their minds, etc. But after the sup finally failed over, traffic began forwarding normally – mostly. Something about the crash took a WS-X6548 card down, too. The card showed ethernet link for everything connected, but was only showing inOctets – no outOctets. That caused some firewalls to lose their minds, because the HA between the firewalls was still all confused. Async routing through firewalls resulted, and oof…what a mess. Once the badly behaving WS-X6548 card was identified and the firewall ethernet connections moved to a different card, life settled back down.

I volunteered to open the TAC case to try to figure out what happened. I looked for the crashinfo ahead of time with a good ol’ “dir all”, but didn’t see anything. I posted a “show tech” for the TAC engineer, hoping that would help. No luck. Without the crashinfo file (which no one can explain why it wasn’t created), we have no idea why the sup took that nasty nosedive…or why it took about 7 minutes for the failover to occur.

We do know that the IOS we’re on is deferred. Obviously, it wasn’t deferred at the time the 6500s went into production – but it is now. We don’t know why it’s deferred (TAC is researching that for me, since the deferral notice for our particular IOS is not publicly available on cisco.com anymore), but deferred usually means Not Suitable For Production. An IOS upgrade was on the board for this year anyway, since we need to go higher to support the 10 gig cards we’re installing. I’ve been hoping that something in the 12.2(33)SXH family gets certified by the Safe Harbor program, since there’s a lot of advanced IOS features in there I’d like to have at my disposal on my core cats.

So my first day back at work was not a gentle easing back into things. Nope – got thrown to the wolves right on day one. <sigh> The kind of an outage we experienced certainly would have hurt less if we had the staffing levels to do more design and less implementation. On my team, we’re usually so busy making stuff go, that we rarely get the chance to do design review and get improvements on the board. We get to do redesign when Something Really Bad happens, and we’re forced to react. That’s life, I guess…but it means my life between now and our peak season (the holidays) are probably going to be a smidge more stressful than I might have liked.

6 comments

  • This may be a good time to mention that our “line rate” 10g ports on our 6704s have been dropping outbound packets starting at about 3-3.5Gbps. Cisco’s best guess is that the buffers on that card are too shallow for bursty traffic, and we’re pushing mostly movies and music. And if you deploy 10 SR Xenpaks you’ll probably see at least one die in every 6 months. It’s the first thing we check whenever a port goes down now and it’s usually the problem.

  • The SSO/NSF is a very good idea in theory, but from my experience on different kinds of equipment, 90% of time it doesn’t work the way it should.
    Cisco keeps improving it day by day but until now i don’t feel confident enough to use it.

    We mostly use redundant chassis (and L2/L3 mechanisms) for such scenarios.

  • Yep. We do the same L2/L3 redundancy as well. The issue was really one of a sup “sort of” dying, but not completely dying. We would have been better off if the scenario was something like a complete power failure to the chassis (essentially impossible in our environment, but let’s just say). But with it “half up” for those 7 minutes before SSO did what it was supposed to, life wasn’t good on that one chassis.

  • Do check the Ram.
    There is a known issue with SUP engines, RAMs and certain types of IOS. Funny things is all three things have to match, a SUP before a certain Serial, IOS before a certain version, but suprisingly it happnes very often

  • Half-dying cards are the worst type of failure. We had that happen to us once on a Foundry switch. Have you experienced any Qos3Outlost drops on your 6548 cards? It’s due to the 1GB per-8-ports ASIC limitation…you might want to look if you have a bunch of servers using these line cards for connectivity….

Ethan Banks is a podcaster and writer with a BSCS and 20+ years in enterprise IT. He's operated data centers with a special focus on infrastructure — especially networking. He's been a CNE, MCSE, CEH, CCNA, CCNP, CCSP, and CCIE R&S #20655. He's the co-founder of Packet Pushers Interactive, LLC where he creates content for humans in the hot aisle.

Newsletter