I keep thinking about a conversation that I had this last December. They described people who were ran their servers and infrastructure well as never existing, or if they did, they do no longer. The person I was talking to called them “old school legendary ninjas.”
They:
- Ran stable systems with high uptime.
- Logged events centrally and paged/emailed themselves on things that were relevant to preserving the integrity of the service before things failed.
- Monitored for uptime, service availability, and weirdness.
- They understood each component of the systems that they created and maintained.
I know that they were around, because I was one of them. We talked to each other in a variety of nerd-only ways most of which are no longer in common usage today.
I viewed this list above as the base standard for competence in performing in the role of whatever title people have picked for their technology manager. It blew my mind that people would sit around and wait for a telco (or worse, their customers) to notice that their leased line or service was unavailable before springing into action to correct the problem. I had the direct contact numbers to all of the top technical support tiers that serviced me in my role. When problems arose, I called them so that things would work correctly and hopefully without anyone else noticing that an outage had taken place. The telco frame and private lines performed poorly (and often still do) and since they never fixed anything the right way, and I knew where the routinely broken issues were to be found from past experience, it was faster for me to call and troubleshoot the issue with them instead of leaving them to their own processes.
Knowing how everything worked was not an exceptional practice. This was normal.
Now everything is going to big data cloud environments. ISPs either need to have a cloud offering or are looking to close up shop and retire. For the mass market, this is where demand has gone.
Perhaps not so surprisingly, standards of service are nothing like they used to be. Outage windows can appear in the middle of the day and stretch for hours. Data providers offer unstable links and just randomly create outages by having people who don’t know or don’t care if they are causing problems while going about their routine activities.
The “who cares, it’s good enough for our partners and customers not to fire us” business sufficiency principal is in effect. They’re not endeavoring to be better than their peers, they just want to avoid sucking less then them so that there is no where else to go.
I don’t think many realize how common departing the business really is. A lot of people leave and do one of two things:
- Opening a bar, restaurant, or other totally predictable and reliable business model (most common referenced example: jwz)
- Getting involved in the next hyper-valuation bubble (almost everyone with a current social media startup)
None of this tolerance for faulty and unreliable service offering was acceptable then and it shouldn’t be now. As most of our enterprises adopt hyper-complex service offerings that later become untenable because of the layered nature and artifact-bloated old code and dependencies. The agitators and those wanting to effect good change in their organizations need to be ready to confront these problems head-on and to smash these problematic stacks and silos Shock Doctrine style when the next crisis arrives. You won’t have to wait long for your next catastrophe. If your ability to detect breaches and downtime is functional, and you have the necessary metrics to capture and quantify them, your way out may be just around the corner.
There are plenty of excuses for an inability to effectively execute change and it’s often the real reason for when things go badly. I would characterize the real cause as inability or unwillingness by the organization to adapt and adopt change effectively as it is a lot easier to blame for incidents and calamity after the fact.
It’s not like we don’t know what is wrong, what to fix, or how to fix it. It’s really a question of many making an active decision that service quality is not worth the investment.
One thing that millennials may add to the business climate when they come of age and take the reigns of leadership is a tipping point of people who don’t try to separate business risk from technology risk and understand them to be one and the same.
Hopefully we won’t have to wait a couple decades more for this to happen.