Skip to content


Single Point of Failure

If you follow tech news, you may have heard about how badly the city of San Francisco screwed up with their network administrator.  Although this is an unusually spectacular blowup, the conditions that existed to create this situation are, sadly, replicated throughout the I.T. world.  This is about an animal called the Bus Test.  In essence, the Bus Test idea is that the overall system should survive if any one person who is closest to it is hit by a bus.  Or disappears.  Or goes rogue.

In system design, be it a network, application or any other piece of automated infrastructure, we eschew single points of failure.  We know that in the real world, things go wrong, and consequently we design systems that have redundancy built in.  If something fails, the system can transfer operations over to a redundant subsystem, and keep on going.  That’s why, for example, websites have back-up load balancers or data servers.  These apparently redundant elements are there to keep the system online if the primary subsystem breaks.

Unfortunately, in the I.T. world, it is common to neglect that the human operators of automated systems are also effectively part of that system.  They can also constitute a single point of failure, and we should be avoiding this problem with humans as well as machines, for the same reasons.

No single operator (be they employee or principal) should ever control exclusive passwords or knowledge about a critical system. To do so makes the system fragile, and sets it up for the kind of snafus that are currently plaguing San Francisco.

In the linked article, it’s described how the network administrator in question was unwilling to allow anyone else to work with the network because he felt they were incompetent, and the configuration for the network was extremely complicated.  I can’t help thinking that centralizing control over the network was a band-aid on a tumor, however.  Essentially the city was running a network that was too complicated for them to staff properly, and was relying on a bad management decision in order to cover for it.

Essentially this is trading risk for cost.  However, it raises the question about whether it was acceptable risk and whether the decision was made consciously, or if it just happened through the ignorance of the network administrator’s supervisors.  I’m betting the latter.   This was sweeping the problem under the carpet, and the city of San Francisco is now paying the cost.  Bad policy.

Give a shout-out:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • Reddit
  • Slashdot
  • Technorati
  • RSS
  • Tumblr
  • Twitter

Posted in commentary.

Tagged with , , .


2 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.

  1. Lon says

    The bus test is one that comes up in regular conversation around my office. Designing a system that passes the test is not that hard, once people keep the idea in mind. Which is easy enough when the boss (me) says it ad nauseum (I do).

    Now, the harder part, is about getting system architects to think about higher order bus test. That is, not whether the system can run after key people are bus-ed (ouch), but whether or not the software maintenance can occur. This is something about ensuring knowledge of maintenance and code expansion.

    A simple way to wrap one’s head around the problem is by asking “what would it take to replace a developer?” For instance, if a system is based on a key technology that almost no one knows, and there is only one guy in the company who knows it. And he gets bus-ed (again, ouch), how system may not crash that moment, but it has transformed into a ticking time bomb.

    This is a problem most companies face in legacy systems that use legacy technologies.

  2. loren says

    @Lon I like how you frame the idea as a “higher-order” bus test. I think you may even be treading on a deeper issue – too often systems are treated as one time investments that are allowed to fester, rather than the builders accepting responsibility of ongoing maintenance of an automated business process. Business processes evolve and require people to keep them alive – automation doesn’t change that basic fact.