While sites like Reddit and Foursquare experienced nearly a day of downtime and are still having issues; our hosting clients experienced a total of 11 minutes of downtime related to the Amazon issues. There was an additional 15 minutes of total downtime in the middle of the night while engineers worked to ensure continued stability in the midst of hardware failures. The site in question had only a single database server, no sites with a pair of replicated database servers experienced any downtime at all.
Enterprise levels of uptime and performance come from the design and support not the hardware. The system within Amazon at the heart of the failures has been their network virtual disk product, EBS. The sites which experienced substantial downtime all utilized that storage for the main hard drives of their systems without the use of RAID. This is poor design having nothing to do with the cloud, because it puts a single point of failure across your environment. Our environment assumes every system will fail at some point and works to mitigate that as much as possible. We do utilize the EBS product, but we do so only in the context of RAID and do not use it for the operating system or core services of the machine. The machines in our enterprise systems do not share any single points of failure, and the loss of one or more machines or even full availability zones is does not pose major problems.
Failures of this nature are going to happen. I dont care if you host on Amazon, Rackspace, Microsoft, a traditional datacenter or the server under your desk, failures happen. Core routers die, backhoes cut cables, systems guaranteed to always be up go down. The key is to design your environment to have ways to deal with every possibility you can think of, and engineers with the experience and creativity to deal with the other million things you didn't.
|
Amazon EC2 outage, not for us
|