There was recent news about how Amazon was down for two hours. Speculation runs rampant on cnet about the cause:
“It doesn’t seem to be the result of a network-initiated attack, at least from my preliminary analysis from our probes,” Ranjan said.
Human error may not sound as gripping a tale as a network attack, but there’s plenty of drama for the people responsible. And it’s the career-limiting variety of drama, said Illuminata analyst Gordon Haff, who hazarded a guess that Amazon’s problem involved its front-end Web servers.
The security group of WebSense, a Web site and communications protection company, also saw no evidence Amazon’s problem was security related.
Having talked to a lot of Amazon people here after my arrival in Seattle, I’m surprised that they don’t have more downtime. Amazon is run like a huge basement operation.
Let me explain.
Amazon doesn’t have a real operational staff. They have developers that code up releases by day and then have to handle first-line response to outages and incidents by night.
As far as I can tell, they have no industry standard monitoring software, configuration management platform, or even any centralized policy framework. They leave everything up to business units to develop all of their own infrastructure and systems management strategy. Best yet, it’s all run by developers.
I think everyone reading this who has been a pro in running operational systems just recoiled in horror after that last sentence.
I understand that entrepreneurial environments want to be as nonconforming and iconoclastic as possible as to “think outside the box” or whatever in-your-face-status-quo stance to encourage innovation, but don’t take that kool-aid to the harsh realm of uptime.
Stability in operational systems by standardizing their build process, quality assurance of code deployments, and operational staffing that doesn’t tax your architectural staff not only leads to better performance, but it also takes your staff out from under the Sword of Damocles of downtime. Having to choose between stability and innovation is a poor choice to make when you can have both, and a cost savings, with a bit of operational sanity.