A lot of people are freaking out this morning because Amazon’s S3 storage service went down. Even I was affected for a short time and in a tiny way: I couldn’t post images to my tumblelog, because Tumblr stores a lot of their data on S3, as does Twitter. The domino effect is in full force here: panicky business people are wailing on Amazon’s forums like Othello: “My reputation, my reputation’s gone!” And why are they wailing? Because their customers are outraged — outraged! — that there should be any interruption in service. Last month the folks at 37signals got the same kind of wrath turned on them last month when their serves lost network connection for two hours. Two hours, but people complained as though it were two weeks.
Nicholas Carr writes, sternly, “Given that entire businesses run on S3 and related services, Amazon has a particularly heavy responsibility not only to fix the problem quickly but to explain it fully.” And I guess that’s true, but you know, these things are bound to happen sometimes. About a week after the 37signals outage, but without reference to it, Joel Spolsky wrote,
Really high availability becomes extremely costly. The proverbial "six nines" availability (99.9999% uptime) means no more than 30 seconds downtime per year. That's really kind of ridiculous. Even the people who claim that they have built some big multi-million dollar superduper ultra-redundant six nines system are gonna wake up one day, I don't know when, but they will, and something completely unusual will have gone wrong in a completely unexpected way, three EMP bombs, one at each data center, and they'll smack their heads and have fourteen days of outage.
As Spolsky notes, quoting Nassim Nicolas Taleb, there are always going to be black swans, the “unexpected unexpecteds,” and while people affected negatively by these events will always — always — say, “You should have known, you should have prepared for this,” this is just what Gary Saul Morson calls backshadowing: “foreshadowing after the fact.” Everything looks obvious in retrospect.
This is true in many walks of life. Malcolm Gladwell wrote a nice essay in the New Yorker a few years ago showing that most critiques of the failures of intelligence operations are forms of backshadowing (though he doesn’t use the term). In The Right Stuff, Tom Wolfe points out how, in the test-pilot days that preceded the space program, every time a pilot bought the farm the other pilots would gather for drinks and talk about how stupid he was, how “he should have known” to do this, or not to do that.
But shit happens, it really does. If you’re pissed off that a web service is down for a couple of hours, then keep all your data on your own computer. But computers crash, don’t they, and hard drives fail. So write everything on paper — but paper gets lost or coffee is spilled all over it. Cell phones get stolen or their batteries die. There are no fail-safe systems. Given the complexity and scope of what they do, the uptime of services like Amazon’s S3 or 37signals’s products is remarkable — the uptime of the whole freaking Internet is remarkable. People need to get a grip.