outage and outrage
A lot of people are freaking out this morning because Amazon’s S3 storage service went down. Even I was affected for a short time and in a tiny way: I couldn’t post images to my tumblelog, because Tumblr stores a lot of their data on S3, as does Twitter. The domino effect is in full force here: panicky business people are wailing on Amazon’s forums like Othello: “My reputation, my reputation’s gone!” And why are they wailing? Because their customers are outraged — outraged! — that there should be any interruption in service. Last month the folks at 37signals got the same kind of wrath turned on them last month when their serves lost network connection for two hours. Two hours, but people complained as though it were two weeks.
Nicholas Carr writes, sternly, “Given that entire businesses run on S3 and related services, Amazon has a particularly heavy responsibility not only to fix the problem quickly but to explain it fully.” And I guess that’s true, but you know, these things are bound to happen sometimes. About a week after the 37signals outage, but without reference to it, Joel Spolsky wrote,
Really high availability becomes extremely costly. The proverbial "six nines" availability (99.9999% uptime) means no more than 30 seconds downtime per year. That's really kind of ridiculous. Even the people who claim that they have built some big multi-million dollar superduper ultra-redundant six nines system are gonna wake up one day, I don't know when, but they will, and something completely unusual will have gone wrong in a completely unexpected way, three EMP bombs, one at each data center, and they'll smack their heads and have fourteen days of outage.
As Spolsky notes, quoting Nassim Nicolas Taleb, there are always going to be black swans, the “unexpected unexpecteds,” and while people affected negatively by these events will always — always — say, “You should have known, you should have prepared for this,” this is just what Gary Saul Morson calls backshadowing: “foreshadowing after the fact.” Everything looks obvious in retrospect.
This is true in many walks of life. Malcolm Gladwell wrote a nice essay in the New Yorker a few years ago showing that most critiques of the failures of intelligence operations are forms of backshadowing (though he doesn’t use the term). In The Right Stuff, Tom Wolfe points out how, in the test-pilot days that preceded the space program, every time a pilot bought the farm the other pilots would gather for drinks and talk about how stupid he was, how “he should have known” to do this, or not to do that.
But shit happens, it really does. If you’re pissed off that a web service is down for a couple of hours, then keep all your data on your own computer. But computers crash, don’t they, and hard drives fail. So write everything on paper — but paper gets lost or coffee is spilled all over it. Cell phones get stolen or their batteries die. There are no fail-safe systems. Given the complexity and scope of what they do, the uptime of services like Amazon’s S3 or 37signals’s products is remarkable — the uptime of the whole freaking Internet is remarkable. People need to get a grip.
Maybe people would be more understanding if they just remembered their Rumsfeld:
The Unknown
As we know,
There are known knowns.
There are things we know we know.
We also know
There are known unknowns.
That is to say
We know there are some things
We do not know.
But there are also unknown unknowns,
The ones we don’t know
We don’t know.
http://www.slate.com/id/2081042/
— Ross G. · Feb 15, 07:32 PM · #
Reading through the forum it sounds as if the issue was as much the lack of communication coming from Amazon as it was the downtime itself. When you’re running a mission critical service, you need to have a strategy for communicating the details of any downtime that (inevitably) does occur. It’s ok for shit to happen, it’s not ok to let people wonder what the smell is for very long.
— TW Andrews · Feb 15, 09:37 PM · #
You can have any level of high availability you want (as long as it’s less than 100%) if you are willing to pay (and wait) for it. A good example of a highly available system is the traditional US public telephone network. It was designed for robustness and has always been a lot more reliable than the internet. (I am told that on 9/11 there was a telephone network center inside the WTC that kept running even after the towers fell.) However, this public telephone equipment is very expensive and very hard to modify. It could never develop as quickly and flexibly as the internet has.
The existence of black swans means that you can never have 100% reliability, but you can still get just about as many 9s as you want – precisely because they are very rare, the outages black swans cause don’t pull down the overall average much.
— JimB · Feb 15, 10:49 PM · #
Agreed that you can get as many 9s as you want if you are willing to spend the money.
But Gladwell is predictably WRONG on all accounts with Intelligence, and axe grinding. Larry Johnson, in a famous article, wrote in July 2001 “The Declining Terrorist Threat” that there was zilch possibility of a mass casualty terrorist attack on the US.
Intelligence is the collection and analysis of foreign regimes and threats. Not random acts. 9/11 was not random like a hard drive or power supply failure, or the severing of fiber cable. Prior to 9/11 we had constant threats by AQ, various foiled US terrorist plots and many successful ones at home. Plus the WTC had been attacked before and there were constant threats to “Finish the job.”
Intel failures are usually the failure to see the obvious: the collapse of the USSR, continued AQ plots against the US, Saddam’s Kuwait invasion, etc. People deceive themselves on perceived risk and think non-random events are random. Because to think of it as non-random would require costly preventive methods.
Funny — WSJ had an article about six months ago about people not using Amazon back-end infrastructure just for the realization that downtime would indeed happen. That mission-critical stuff could not leverage Amazon, only extra stuff that could tolerate downtime.
— Jim Rockford · Feb 16, 08:10 AM · #