Friday, September 18, 2009

The Triumph of Bureaucracy


I know I'd promised to finish up my mini-series on offshoring, but I like to be 'agile' with this blog. If a subject comes up in my day-to-day life that seems worthy of blogging I'd rather cover that and return to the more calculated subjects later. In any event, I've had a blog on my backlog about bureaucracy for some time now. Now, that entry is still yet to come, but today I will a share a minor case study in the subject while it still has some immediacy to me. My own version of thedailywtf if you will.

I work for a company that does regular work for cell phone carriers (most of the major ones in the US). Our software is used by millions and millions of people a day and we generally have very rigorous SLA agreements around uptime, etc. There are stiff penalities (I'm talking hundreds of thousands of dollears) if we fail them.

Understandably, this has lead to quite a bit of process around our release process. The typical process is the Pgm for a given project puts together a SiS (short interval schedule) for release of code. A typical high level schedule might look something like:


  1. File formal RFC (Request for Change) to internal teams and carrier

  2. Create SIS (short interval schedule) detailing timeline and steps of release, go/no go criteria, and latest time at which a rollback can occur

  3. Formal review of the SIS with development team, as well as technician, release readiness analyst, and tier 3 support

The release themselves always occur within predetermine carrier maintenance windows, which differ slightly between carriers, but typically fall on one or two days each week, from about 11pm - 3am. The release itself usually looks something like this:



  1. Send formal notification to carrier

  2. Check that all production monitors are in green

  3. Disable monitors

  4. Pulling a subset of boxes out of a given VIP (for a subcomponent) and update it with new code.

  5. Return the updates boxes to the VIP and remove the other boxes from this same VIP to complete the update.

  6. Repeat steps 4-5 for each affected subcomponent

  7. QA smoke test

  8. Go/no go

  9. Send formal notification of completion to carrier

Obviously, this level of rigor is understandable when you're touching something that gets millions upon millions of hits per day. The problem is when it's not appropriate.


Recently, I've been involved in a minor update to a simple testing site we've provided to one of our carriers. It's a legacy app written in PHP and MySQL, whereas our production code mostly runs on .NET and SQLServer. It's consists of about 2 production boxes, with a MySQL instance running on one of them. It hasn't been updated in about 3 years and is provided by this carrier to its handset manufacturers to do a first pass test on new handsets that are being released.


There are a trickle of requests per day (I'm talking like maybe 10 on a busy day) and it has little visibility wtihin the carrier. There are a few contacts at the carrier who use it but thus far they've been pretty amenable to anything we've suggested as they haven't seen an update to this app in over three years. Our current project mainly consisted in adding a few new tests and removing a couple that are no longer used.


We asked if we could release it in the day and they were complete agreeable to this, saying they just needed to notify their contacts at the device manufacturers. Seems like a no brainer, huh.


Except our customer PGM team responded by stating that it's a production box and so it *has* to be released in a nighttime release window. We need a bunch of people to come in from 11pm to 3am. The fact that we could even release it during the day without any impact to users (there are two boxes and only minor additions to some table structures that wouldn't impact the 4-5 end users) even if we *didn't* have the carrier send them an email saying 'Stay out of the system for an hour' wasn't acceptable.


So, instead of having an email sent, having people make some minor updates during the day, and doing what the logical thing, we're going to have all the process and bureaucracy of the type of release we'd have if we were impacting millions of users. Why is that? Because people are incapable of thinking critically, they are rigid. Logic is immaterial to a bureaucracy. At times i feel like I'm stuck in the film "Brazil"



Sam Lowry: Excuse me, Dawson, can you put me through to Mr. Helpmann's office?
Dawson: I'm afraid I can't sir. You have to go through the proper channels.
Sam Lowry: And you can't tell me what the proper channels are, because that's classified information?
Dawson: I'm glad to see the Ministry's continuing its tradition of recruiting the brightest and best, sir.
Sam Lowry: Thank you, Dawson.


This is simply one example, but next time I hear a question of why even simple things are so expensive, I'm goign to have to bite my tongue. I don't even care to argue anymore.

4 comments:

Jeremy Walker said...

Sounds like a basic risk management exercise. xx% of operations carry $yy penalty with zz% of failure. Executing the same process regardless of actual risk reduces incidents by a%. a% * $yy penalty > cost to follow procedure when its a total waste so waste the time to maximize profit/minimize cost. Frustrating sure but logical in a perverse bureaucratic way.

Code Monkey said...

The thing is, this particular component has no enforced SLA penalty. Further, I guarantee no one has done such a methodical analysis of risk. Again, it's mere rigid application of some set of principles irrespective of whether or not they make sense within a given context.

And it drives me fucking insane.

Dantelope said...

Tell them you'll pay for any damages that occur to the client and their customers.

If you can say that, then you've got a point.

If you can't.... well.... see you at 11pm!

Code Monkey said...

@Dantelope

That's an interesting statement. A bit flippant perhaps, but I don't disagree insofar as I do agree one should be willing to take responsiblity for his actions and decisions. That said, you are suggesting I bear the monetary penalty for a decision made for the business for which I work.

If I am to directly pay the penalty for my mistakes, would it then not be reasonable to expect I would also directly profit frp, those decisions that generate revenue or save money for the company?

If you are arguing that I *only* pay penalities for mistakes, but do not profit from my potentially risky (but also potentially profitable) decisions, would I then not work to minimize my own risk and, in fact, make all decisions on the goal of minimzing risk. In fact, I would engage in my job so as to make as few decisions and do as little work as possible. If a company came to us with a new job that would land the company $100 million dollars in revenue but I knew there was a 10% chance we would fail to meet the deadline if it went wrong and I would have to pay out of pocket for breaking the contract, what decision would you make?

Every decision one makes on behalf of the company for which he works has the potential to profit or harm the company, either directly or indirectly. Generally, compensation (other then the sales force) is not directly based upon this however. There is some indirect effect vis a vis things like profit sharing plans or simply the fact that with continued bad decisions the company will no longer be able to pay my salary.

There is little point in working for a company if I directly profit or pay based on it's performance. That situation is called being an 'entrepeneur' in which one has both leverage and where his individuals decisions affect the profitability (or lack thereof) of the company for which he works.

In this instance, I made a few decisions going into this project that saved the company thousands of dollars. As soon as I'm written a check for those savings, I'll be happy to pay any SLA penalities incurred as a result of a daytime release of this system going awry.

Particularly because, as I wrote in my original post, there is no enforced SLA penalty for this component and its users agreed verbally that they could ask others to stop using it for a brief period during the day while we released the new code.