A failure of Amazonian proportions…
It’s often said that you only know the real value of something when it’s gone. That’s a feeling which must have been shared the world over last week when some of the biggest online services suddenly didn’t work any more.
The problem was a sudden and catastrophic failure at Amazon Web Services. While many people only consider Amazon a purveyor of online shopping, and downloadable media, it’s vast infrastructure actually powers huge chunks of the internet.
Amazon’s S3 storage service provides the big buckets that major brands can shove their vast amounts of fast-moving data into. This is true not just for their websites, but for thousands of the world’s most popular apps and internet-connected devices.
The sheer scale of the Amazon cloud makes it very cost effective – until you have to count the cost of failing to have some redundancy built into the system!
When S3 failed, apparently even Amazon couldn’t get in to their own dashboards to update customers on the service status and for a few hours social media and the traditional news media went into an Amazon-bashing frenzy.
As more and more of the work of every business goes into cloud-connected services, this is a timely reminder of why it’s so important to get the details right.
Many old adages come into their own here, but perhaps the most obvious is the one about not putting all of your eggs in one basket…
To be fair to Amazon, it’s infrastructure is segmented into global areas which operate largely independently and in this case it didn’t all fail. The company recommends that users spread the load across it’s segments for reasons of just such an occurrence. The trouble is, for a big project programming in that kind of redundancy is more complexity and cost for developers and the people who pay them, so it often doesn’t happen, especially when a product or service is being rushed onto the market.
There’s every chance you were affected in one way or another, whether it was because your business systems (accounting, HR, sales, etc) store their data on Amazon S3 or because you were out for a run and trying to update your Strava account!
While the consequences of this massive failure are fresh in your mind, why not take the opportunity to look into how many of your critical systems are vulnerable to a single point of failure? When you find an issue, there’s always a fix and duplicating or making safe crucial data and systems doesn’t mean you have to double your investment in them. Being prepared brings wonderful peace of mind – and avoids embarrassing fallout when a data centre half a world away lets you down.