Archive for May, 2009

random problem in your cloud-hosted app? try a new instance!

Tuesday, May 26th, 2009

chalk this one up under ‘Time Sunk’.

my Ambience for the Masses app is a Spring / Hibernate / JSP stack, with a couple of other sweet components. i run it in on an AWS instance. it’s been purring along just wonderfully for months now. then, about ten days ago, it just stopped working

normally Java apps don’t die without throwing some sort of Exception. but that’s just what was happening. so i stripped out various components — thank you, Dependency Injection pattern! — and found that it would sometimes die instantly (if i was lucky) but usually it took a couple of hours. i don’t have a lot of free time to track down random intermittent bullshit like this, so it took me about a week to boil it down

it was somewhere in the Current Listener Map — my Shoutcast-listener-tracking geo-positioning statistics-gathering data sculpture back-end engine. i hear that the kids call them things mash-ups. the geo-lookup APIs were the most delicate part, and it seemed to work with them omitted. that red herring aside, it turned out to be the IP address resolution

that block was just a couple of lines until i’d added all the logging and desperate Exception handling. the app launches a lot of threads, so tracking down the issue was annoying … but eventually there it was in the traces. the stack just terminated when the app tried to getHostAddress (not during getByName though … must be a lazy-loading thing)

so i nearly had it all tracked down to that. then Tomcat inexplicably became unable to find basic JARs in /usr/share/java — i was using Fedora 8’s RPM version vs. raw Apache, and it’s organized real funny-like

so i threw up my hands and started up a fresh instance of my webapp AWS image. i’d rebooted the existing one, and that hadn’t helped at all. of course, the issue magically disappeared on the new instance. did anyone see that coming from a distance? ya probably did. cuz it’s ironic. and it’s the title line of the damn blog post

the hostname resoultion is surely a low-level OS thing. both Linux JavaSE 6 and IcedTea 7 just shat the bed when they got to that point, unlikely unless they both leveraged the same lib call. something must have gone wonk in the virtualization, and apparently a key part of the solution was running it inside of a different farm. i wasted a helluva lot of time to find that out

lesson !! if weird inexplicable freaky-ass things start happening to your cloud-hosted app, load it up on a new VM earlier than later. i’d taken a late-stage backup of the failed instance and assumed it would be corrupted with the Mystery Bug (read as: a waste of time to attempt). but oh no, it worked just great :) . next time it’ll be a cinch to just bundle the instance up, image it, and use it to launch a new one

and ultimately … it wasn’t a bug in my code !!!

being too rapid on the things that matter

Thursday, May 7th, 2009

it took me a while to come up with the title for this post. and it’s and Opinion Piece, not Techincal … so you’ll see why …

i’m working for a new company now, and they’re rocking it for RoR apps on the iPhone. sounds like a good place to be. one of the many reasons why this position works for me is because these guys are all about GTD and getting it out there. lean ‘n’ mean

whereas i’ve become very used to a holistic detail-orented, wisened test-backed process. great for Enterprise, but not so much for the reckless streets of Startup 3.0 . so i’m in a learning process. i’ve turned around some good stuff quickly, and it’s very satisfying

but i’ve screwed the pooch twice since i’ve been there. it’s totally a judgement call thing — i’m shooting too fast from the hip, and don’t feel like i really grasp the balance here …

first project i worked on was related to account management. they wanted a quick turn-around, i gave it a shot, had the whole thing backed with solid testing, and ready for on-time deployment with a smile. and in trying to keep track of all the new system permutations — i’d been there 2 weeks or so — i forgot one basic thing, and forgot to test for another. a nice little Perfect Storm. one emergency 1am database rollback later, we had a load of pissed customers and a helluva lot of explaining to do

so, then this past week, i went in to fix a minor rounding issue bug. those can be touchy. the right way to do it is with BigDecimal. yep, i’ve done that in Java too with BigDecimal. overall, it’s somewhat ponderous, detail-oriented, and can easily be polluted with Floats and the like. so i’d taken a shortcut, realizing that the low-level C impl was doing String conversion without the rounding issue. so i took the low-hanging fruit:

total.to_s.to_i

awesome !!1!. well, that is until you get into the 100-of-trillions area, otherwise shown as 1.0e+14. guess what happens when you parse that into a Fixnum? no database rollback this time, but Da Boss had to spend days sorting out the visceral impact of ridiculous sums of bogus exploit money pouring into our RPG

security, privacy and account management. payment calculations. not the sort of things to take shortcuts on. yet, if you’re embracing a culture that wants it done quickly and with minimum impact, it’s a risk you might be willing to take. it’s not like i didn’t have test scripts … i just forgot to head into scientific notation territory. just like i forgot to check for the implication of null password acceptance ( long story there, special account cases, etc. )

i’m putting these things up here for my fellow developers to laugh at.   “I mean, c’mon. All that’s totally obvious stuff.”   “I’d never miss that, that’s sophmore shit.”   good, get it out of your system, laughing boy

but believe me, when you’re on the other end of it, and had been in the middle of it and all full of all the other things that you needed to keep track of at that time, heh, well, that’s when you’ll really need to keep yerself laughing :)