
Outages reveal ownership, not just technical weakness
The ongoing #AWS us-east-1 outage also uncovers organizational and mindset problems well beyond the cloud provider's services.
Blame-shifting, avoiding responsibility, not owning the problem, covering one's behind is an evergreen topic. Even cloud services are developed, tested and operated mostly by humans and we humans just do mistakes so outages will happen even if no equipment fails in interesting ways and when there is no disaster or attack.
Some people and companies tend to point a finger at the provider, which is unfortunate behavior that tends to lower trust and loyalty of their customers. Mature companies and people in them own the problem instead, apologize for the disruption, get their best people on the problem and present concrete next steps or at least some specific point in time when they will give the next status update.
After the problem has been solved you can explain how the #outage fits in your risk management e.g. that you target 99.9 % availability of “three nines” in your Service-Level Objective over the course of a year. For a free, nice to have service that might be perceived as generous, for a commercial service that might be on the lower end side of things “you've got what you've paid for”. Big corporations usually want a Service-Level Agreement where you pay a fine or give a discount, if you cannot hold it. You have to manage expectations ideally before an outage.
In reality though, from the business point of view, the affected parties don't care that much about why you failed but how it affected their business and what it means for the future. Will you engineer for higher availability? Will you have better fallbacks? Will you compensate even if you don't have to?
❗If you are the one delegating responsibility, you are ultimately still responsible because you've chosen the conditions under which you delegate and to whom.
And you don't just delegate responsibility when you rent something, buy a service but also when you buy a product with or without a support agreement. Software and everything else that you cannot change yourself is relying on somebody else's responsible behavior.
Many businesses delegate in this way somewhat blindly because “that's what everybody does” and other biases. That's why so many companies are affected by the us-east-1 outage. Take this opportunity to observe which companies will shift blame even when they could've engineered their services better or could own up to the fact that they target lower SLOs. Those businesses will tend to overpromise and underdeliver and their valuation might be overblown.