I work in software development. I’m a “Senior Software Project Engineer” which means that I work with other people to define their needs (and wants) and then lead a team that design, architects, and implements a solution. As I’ve moved up over the years I’ve worked with a bunch of different people many of whom are experts in their area. Recently, I got a little bit of verbal lashing from one of these people who is the head of IT Operations. While the person almost never says anything nicely they do almost always end up being right so I try to ignore almost everything they say and instead try to get the message, because, like I said, they are actually almost always right and they’re really good. So what were they right about this time?
Our logging sucks. We log all kinds of stuff. Some of the errors we log are legit. But many, if not most, are total garbage. Some of the “errors” are not really errors (that’s another blog post). But what about those that are truly errors? What could be wrong with logging them?
Imagine an error that says “Fatal error during processing.”
Ok. Now what in the work does that mean? And what should I do about it? Can I just rerun processing? Do I have to do some kind of clean up first? Should I report it to someone? Was the problem related to the software logic, to the environement (disk space, network down, etc), to the input? What the hell am I, guy who’s job it is to make sure work gets done, supposed to do with that error? I suppose I’ll probably just try to run it again and cross my fingers, right? Well I’m only doing that until someone’s software doesn’t clean up after itself [that’s also another blog post) and causes havoc by being restarted… then I’m out of the business of trying to be helpful and I’m in the business of complaining.
Let’s re-imagine that same error now says “Fatal error during processing – insufficient disk space available”. That’s better, right?! Sure it is. I’d much rather have that! But I still haven’t answered over half of this operator’s questions. Can I just rerun? Do I have to clean up some runtime data first? I t’s better but not really completely helpful.
Trying again. “Fatal error during processing – insufficient disk space available. Process requires atleast 1GB available disk space on volume /server/vol1 to run. Create necessary disk space and rerun.’
Now we’re cooking! All the information anybody needs to have is there in the log. Happy Ops people! And honestly, if you’re DevOps, you probably care about this even more because it’ll be you trying to figure out what went wrong. Logs are important, good logs are a godsend.
So next time you’re fixing an issue and you’re digging through code to try to find the cause of a problem – that you only know about because of a log – keep in mind that most of that digging could be avoided by better log messages. Take the time to update your log messages while you’re in there rather than just fixing the bug. You’ll be glad you did and you’ll make life better for you and for someone else!