Lessons Learned From a Recent Amazon Outage

By Gregory Machler, originally posted at Computerworld.com

Another Amazon cloud-services outage occurred on Sunday, August 7th in a Dublin, Ireland data center. This one occurred due to a lightning strike that hit a transformer near the Dublin data center. It led to an explosion and fire that knocked out all utility services thereby leading to a total data center outage. Amazon had its only European data center located there.

My initial thoughts are related to disaster recovery and Amazon services. In their last significant outage in April, they had a network configuration change that led to an outage of services in the eastern United States. This outage begs other questions. Why isn’t Amazon deploying a redundant power source, like a diesel powered backup? Maybe they did, but the fire blew out a portion of that utility service. So a more serious disaster emerged from an initial transformer explosion.

How could this be addressed? How about fail-over to services in another geographic location in Europe. This didn’t happen. I can only guess that building out another data center is cost prohibitive at this time and that is why Amazon doesn’t have another European data center. The rest of the article mentions that it will take Amazon up to two more days to bring up the remaining servers.

It mentions that a significant period of time is being taken to start all of the servers up again. It also states that Microsoft, who has services in the same data center, does not have the same weakness. I wonder why this is; data replication should be a high priority, especially when Amazon lacks full-scale data center disaster recovery.

On Monday, August 8th, Amazon mentioned that a software error is slowing the recovery of the data within the European data center. This points to another error, a lack of business continuity testing. This testing is necessary, because conditions like this occur rarely. It also points to the fact that complex configurations make it hard to test various scenarios. Only deploying and testing a minimum number of application configurations is realistic. Otherwise there are too many permutations to test. See a previous disaster recovery article that mentions products should have standard configurations, similar to a car engine configuration and the car model.

So, it looks as if Amazon has more cloud services weaknesses that are bubbling up due to operational stresses. How can mid-sized and small businesses that outsource their web applications to Amazon’s cloud protect themselves? It’s clear that Amazon supports cloud applications where profitable. I suggest that those firms create a very detailed, per application SLA (Service Level Agreement) that lists global up-time, performance, and penalties when service isn’t meeting objectives.

In my last couple of articles, I outlined questions to ask the service provider that reveal a current application’s architecture. These questions can be asked for all of the applications that a company wants to be managed by a cloud provider. This information along with up-time requirements and performance statistics can be combined to form the SLA.

It is likely that Amazon and other major cloud providers will not support extensive disaster recovery plans until the SLAs penalize them into delivering that service well. Well defined SLAs lead to global trade growth because they ensure business is running well globally. This business handshake leads to trust between the two parties. And we all know that ‘Trust is Trade.’

Share Article: