… and just to preface things – this is not a boastful blog entry. Everything I did in the technical realm was either (a) a simple fix or (b) being helpful – nothing to brag about. It’s the circumstances that make it something I’d like to put down on record
So, Friday night I was stitching together the last parts of my Burning Man coat. It’s made of fur, and ridiculous by design. I’m adding some needed collar reinforcement, when suddenly I start getting Prowl notifications. My health checks are failing. ”Ah, crap, not again,” says the guy who’s used to running a totally-non-critical app platform in the AWS cloud, “I’ll get to it after I’ve finished sewing buffalo teeth into the collar.” So I did.
My instance’s CPU appeared to be spiked – I could hit it with
ssh, but the connection would time out. A reboot signal resolved the issue (after an overnight wait). And it was thus that I fell victim, like so many others, to Amazon’s ThunderCloudPocalypse 2012. And the secret bonus was that one of my EBS volumes was stuck in attaching state. ”Ah, crap, not again,” says the guy who’s gonna lose some data (because he has backup scripts for Capistrano but no automation for them yet), and I’m forced to spin up a new volume from a month-old snapshot. No worries - it wasn’t my MySQL / MongoDB volume, just the one for my blog & wiki & logs. I got that up and running on Saturday in-between rehearsing scenes for The Princess Bride (coming to The Dark Room in August 2012 !!)
Then I was immediately off to rehearsal for my Dinner Detective show that night. Yeah, it was one of those kind of Saturdays. So, I was sitting there waiting for my cue, when at about 5pm PDT, failure txts suddenly start raining down from work. And from multiple servers that have no reason to have load problems. I log into our Engineering channel via the HipChat iPhone app, and our DevOps hero is already on the case.
ElasticSearch has pegged the CPU on its server, and JIRA & Confluence are going nuts as well. Something’s suddenly up with our Java-based services. I ask him to check on Jenkins, and sure enough, it’s pegged too. And no one’s pushed anything to build. He goes off to reboot services and experiment, and I go off to check Twitter to see if we’re the only ones experiencing it. Sudden JVM failures distributed across independent servers? That’s unlikely. He guesses it’s a problem with date calculation, and he was absolutely right.
Hello leap-second, the one added at midnight GMT July 1st 2012. I RT:d a few good informative posts to get the word out – what else can I do, I’m at rehearsal and on my phone! – and then let DevOps know. We were able to bring down search for a while, and it turns out rebooting the servers solves the problem (even without disabling
ntpd, as other folks recommended). So, disaster averted thanks to Nagios alerts, a bit of heroic effort, and our architect’s choice of a heavily Ruby-based platform stack.
Again, as I prefaced; nothing impressive. No Rockstar Ninja moves. No brilliant deductions or deep insightful inspections. Neither lives no fortunes were saved. and I got to wake up on Sunday, do laundry, pay my bills, and go out dancing to Silent Frisco for the later hours of the afternoon. But it was fun to have been caught up in two different reminders of how fragile our amazing modern software is, and how the simplest unexpected things – storms in Virginia, and Earth’s pesky orbital rotation – can have such sudden, pervasive, quake-like impacts on it.