random problem in your cloud-hosted app? try a new instance!

chalk this one up under ‘Time Sunk’.

my Ambience for the Masses app is a Spring / Hibernate / JSP stack, with a couple of other sweet components. i run it in on an AWS instance. it’s been purring along just wonderfully for months now. then, about ten days ago, it just stopped working

normally Java apps don’t die without throwing some sort of Exception. but that’s just what was happening. so i stripped out various components — thank you, Dependency Injection pattern! — and found that it would sometimes die instantly (if i was lucky) but usually it took a couple of hours. i don’t have a lot of free time to track down random intermittent bullshit like this, so it took me about a week to boil it down

it was somewhere in the Current Listener Map — my Shoutcast-listener-tracking geo-positioning statistics-gathering data sculpture back-end engine. i hear that the kids call them things mash-ups. the geo-lookup APIs were the most delicate part, and it seemed to work with them omitted. that red herring aside, it turned out to be the IP address resolution

that block was just a couple of lines until i’d added all the logging and desperate Exception handling. the app launches a lot of threads, so tracking down the issue was annoying … but eventually there it was in the traces. the stack just terminated when the app tried to getHostAddress (not during getByName though … must be a lazy-loading thing)

so i nearly had it all tracked down to that. then Tomcat inexplicably became unable to find basic JARs in /usr/share/java — i was using Fedora 8’s RPM version vs. raw Apache, and it’s organized real funny-like

so i threw up my hands and started up a fresh instance of my webapp AWS image. i’d rebooted the existing one, and that hadn’t helped at all. of course, the issue magically disappeared on the new instance. did anyone see that coming from a distance? ya probably did. cuz it’s ironic. and it’s the title line of the damn blog post

the hostname resoultion is surely a low-level OS thing. both Linux JavaSE 6 and IcedTea 7 just shat the bed when they got to that point, unlikely unless they both leveraged the same lib call. something must have gone wonk in the virtualization, and apparently a key part of the solution was running it inside of a different farm. i wasted a helluva lot of time to find that out

lesson !! if weird inexplicable freaky-ass things start happening to your cloud-hosted app, load it up on a new VM earlier than later. i’d taken a late-stage backup of the failed instance and assumed it would be corrupted with the Mystery Bug (read as: a waste of time to attempt). but oh no, it worked just great :) . next time it’ll be a cinch to just bundle the instance up, image it, and use it to launch a new one

and ultimately … it wasn’t a bug in my code !!!

Tags: ,