Join Date:Joined: Jan 2008
I couldn't agree more. Maintaining that level of availability requires an exceptional level of change control discipline, peer review, and fault tolerance. It adds cost and makes system maintenance, upgrades, and changes a huge pain in the a$$. IF Samsung outsourced this, you're also correct that the SLA breach (depending in the contract terms) would likely carry a financial penalty.
I use five nines as an extreme example, but also point out that Google was only showing availability on the order of four nines.
Had a single service been affected, I wouldn't have thought much about it. Had everything been down for an hour or so, it would have been a major outage, but I'd reasonably give them that one. It's the fact that everything was offline for half a day that I found to be a bit striking for a company that markets products that advertise a number of those services as core features.
Maybe a post-mortem will cause samsung to revise their DR strategy, maybe it won't. Without knowing what their service level targets are, who could say. Perhaps they regard the services offered as purely for entertainment purposes and consider the outage to be perfectly acceptable, as many on this forum do. Hopefully it will never be tested again, especially with a life-threatening disaster.
My bottom line point is that this outage demonstrated what a service loss at Samsung could look like. Samsung promotes their pro line tablets as business tools. They never stated what kind of reliability comes with the backend services that accompany those business tools, but now we know and can plan accordingly. If you could find yourself in a position where loss of access for half a day would put you in a bind, it became apparent that alternative solutions need to be considered. Samsung may never have another outage of this sort again, but they aren't publishing an SLA to their customers, so historical data is all we can use and this event was historically significant. I guess I'm the only one that was surprised or even bothered by it, which is fine. In such a case, Samsung is doing exactly as they should, spending the minimum amount of money to provide a service level that just meets the demands of the majority of their customer base.
XDA members represent a small subset of the user base and one that I would consider to be more demanding of their devices than the average user. This thread has yielded some really interesting perspectives on what people consider to be appropriate service levels for consumer services in modern times. I'm obviously an outlier on high availability side, but what would have been the other end of that spectrum? A full day? Two? A week? Just curious...
Maybe my clients are spending way too much on availability because they assume the average consumer is more demanding than they really are. Restoring from backups to a DR site is certainly cheaper than maintaining multiple active datacenters and near realtime data replication across a distance that can be considered survivable in the event of a natural disaster.