A Weekend of Craft, Theatre, and Technical Meltdowns

… and just to preface things – this is not a boastful blog entry. Everything I did in the technical realm was either (a) a simple fix or (b) being helpful – nothing to brag about. It’s the circumstances that make it something I’d like to put down on record

So, Friday night I was stitching together the last parts of my Burning Man coat. It’s made of fur, and ridiculous by design. I’m adding some needed collar reinforcement, when suddenly I start getting Prowl notifications. My health checks are failing. ”Ah, crap, not again,” says the guy who’s used to running a totally-non-critical app platform in the AWS cloud, “I’ll get to it after I’ve finished sewing buffalo teeth into the collar.” So I did.

My instance’s CPU appeared to be spiked – I could hit it with ssh, but the connection would time out. A reboot signal resolved the issue (after an overnight wait). And it was thus that I fell victim, like so many others, to Amazon’s ThunderCloudPocalypse 2012. And the secret bonus was that one of my EBS volumes was stuck in attaching state. ”Ah, crap, not again,” says the guy who’s gonna lose some data (because he has backup scripts for Capistrano but no automation for them yet), and I’m forced to spin up a new volume from a month-old snapshot. No worries - it wasn’t my MySQL / MongoDB volume, just the one for my blog & wiki & logs. I got that up and running on Saturday in-between rehearsing scenes for The Princess Bride (coming to The Dark Room in August 2012 !!)

Then I was immediately off to rehearsal for my Dinner Detective show that night. Yeah, it was one of those kind of Saturdays. So, I was sitting there waiting for my cue, when at about 5pm PDT, failure txts suddenly start raining down from work. And from multiple servers that have no reason to have load problems. I log into our Engineering channel via the HipChat iPhone app, and our DevOps hero is already on the case.

ElasticSearch has pegged the CPU on its server, and JIRA & Confluence are going nuts as well. Something’s suddenly up with our Java-based services. I ask him to check on Jenkins, and sure enough, it’s pegged too. And no one’s pushed anything to build. He goes off to reboot services and experiment, and I go off to check Twitter to see if we’re the only ones experiencing it. Sudden JVM failures distributed across independent servers? That’s unlikely. He guesses it’s a problem with date calculation, and he was absolutely right.

Hello leap-second, the one added at midnight GMT July 1st 2012. I RT:d a few good informative posts to get the word out – what else can I do, I’m at rehearsal and on my phone! – and then let DevOps know. We were able to bring down search for a while, and it turns out rebooting the servers solves the problem (even without disabling ntpd, as other folks recommended). So, disaster averted thanks to Nagios alerts, a bit of heroic effort, and our architect’s choice of a heavily Ruby-based platform stack.

Again, as I prefaced; nothing impressive. No Rockstar Ninja moves. No brilliant deductions or deep insightful inspections. Neither lives no fortunes were saved. and I got to wake up on Sunday, do laundry, pay my bills, and go out dancing to Silent Frisco for the later hours of the afternoon. But it was fun to have been caught up in two different reminders of how fragile our amazing modern software is, and how the simplest unexpected things – storms in Virginia, and Earth’s pesky orbital rotation – can have such sudden, pervasive, quake-like impacts on it.

NOTE: If your screen reader is reading this, please contact me at admin@cantremember.com ... because it shouldn't. FIXME: build this dynamically based upon the maximum content in any sub-Element of this Element. I will call this my "Safari Reader Counterweight". In some of my Posts, I have huge code excerpts, etc. Safari Reader, at least in iOS, will identify the 'main Element', the one it features, based upon its content length. Sometimes those code excerpts get identified as the 'main Element', and the Post is borked in Safari Reader mode. This is a counter-weight; it gives the <main> Element additional content so that it gets featured, algorithmically. Yes, it increases the payload of every page (@see FIXME above). But not by that much. Then again, this is a guess as to how much content any given Element could contain. If it's not enough, BOOM, Safari Reader looks like crap. So, here's a great article on how to enable Safari Reader on your site. It's mostly guesswork, but those guesses helped me debug this obtuse goddamn problem. Oh, and look, you can enter and exit Reader programmatically. JavaScript can fix anything. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload. I promise I will never cut-and-paste lines of text simply to add Element payload.